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(57) Abstract 

A modular neural ring (MNR) system is provided for neural network processing which comprises one or more primitivt 
rings (36. 38 and 40) embedded in a global communication structure (GCS) (44). The MNR bus (44) is a multiple access multi- 
master, arljitration. hand-shaked data bus. Each primitive ring is a single instruction stream, multiple dau stream (SMD) ma 
chine being a control unit for controlling in parallel a number of processing elements (PEs) (54, 56, 58) connected by a local com 
munication network (64). Within a primitive ring, a master controller (86) controls housekeeping functions, scratch pad memor 
and synchronization. A processor controller (88) transmits signals to all of the PEs on the primitive nng to carry out vector pro 
cessing An interface controller (90) controls the primitive ring's external interfaces to the GCS (44). Compulation within a pro 
cessing element is perf-ormed by processor logic blocks (PLB) (150). Each PLB implements a RAM based shift register scheme. 
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TITLE OF THE INVENTION: 

MODULAR PARALLEL PROCESSING SYSTEM 

BACKGROUND OF THE INVENTION: 

5 The architecture and nature of neural network processing machines promise a solution 

to certain kinds of problems, that are too slow if not impossible to solve even on most 
powerful currently available computer. When research interest expanded to artificial 
intelligence, researchers realized that only limited progress can be made with current 
computing technologies. Computes are currently limited by their serial Von Neumann-type 

10 of architecture, and because they are essentially discrete symbol-processing machines. 

Human beings are not purely logical, nor can human behavior be regulated by 
. . -.athematical or logical formulas. Human beings do not make decisions based on evaluating 
several hypotheses through formal probabilistic methodology, nor do human beings go step 
by step through any existing pattern-recognition algorithm to recognize objects. Observations 

15 of human behavior indicate that it is very difficult to achieve a state of discerning inteUigence 
without an inductive processing tool. 

According to the definition given by the Defense Advanced Research Projects Agency 
(DARPA), "The Neural Network is an information processing system which operates on 
inputs to extract information, and produces outputs corresponding to the extracted information 

20 ... Specifically, a neural network is a system composed of many simple processors-fully, 
locally, or sparsely connected-whose function is determined by their interconnection topology 
and strengths. The system is capable of a high-level function, such as adaptation or learning 
with or without supervision, as well as lower-level functions, such as vision and speech 
pre-processing. The function of the simple processor and the strucmre of the connections are 

25 inspired by biological nervous systems". 

The key attributes of neural network functions are massive parallelism and adaptivity. 
The massive parallelism results in high-speed performance and in potential fault tolerance. 
Adaptivity means that the neural networks can be trained rather than programmed, and their 
performance may improve with experience. Another important advantage of neural networks 

30 is that they enjoy parallel processing while remaining simple to use. From an information 
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processing viewpoint, neural networks are pattern-processing machines whose function is akin 
of the inductive inference associated with human brain functions. This is unlike presoit 
available, deductive computers which are based upon symbolic logic processing. 

Figure 1 iUustrates the type of biological neuron (20) which has influenced the ' 

5 development of artificial neural networks. The synapse (22) is the tissue connecting neurons. 
It is capable of changing a dendrite's local (24) potential strength in a positive or negative 
direction, d^nding on the pulse it transmits. These transmissions occur in very large 
numbers, but since they are chemical, they occur fairly slowly. The neuron is the processing 
element in the brain, that is, it receives input, performs a fimction and produces an output. 

10 The neuron has two states: firing or not firing. The synapse is' basically composed of the 
data-line connecting neurons, but is much more. A synapse may be inhibitory or excitatory. 
When a neuron's output is connected through an inhibitory synapse to another neuron, the 
filing of that neuron discourages the firing of the neuron to which the signal goes. On the 
other hand, a neuron receiving an input through an excitory synapse will be encouraged to 

15 fire. Synapses may also have weights associated with them, which indicate the strength of 
the connection betwerai two neurons. If the firing of one neuron has a lot of influence on the 
firing of another, the weight of the synapse connecting them will be strong. The human 
cerebral cortex is comprised of approximately 100 biffion (10") neurons with each having 
roughly 1,000 dendrites that form some 100,000 billion (10'*) synapses. The system 

20 fimctions at 10,000 billion (10") interconnections per second if it operates at about 100 Hz. 
The biain weighs approximately three pounds, covers about 0.15 square meters, and is about 
two millimeters thick. This capability is absolutely beyond anything that can be presently 
constructed or modeled. Understanding how the brain performs information processing can 
lead to a brain-like model, and possible implementation in hardware. 

25 Artificial neural networks (ANNs) are inspired by the architecture of biological 

nervous systems, which use many simple processing elements operating in parallel to obtain?- 
high information processing rates. By copying some of the basic features of the brain intp 
a model, the ANN models have been developed which imitate some of the abilities of the 
brain, such as associative recall and recognition. ANNs are general-purpose pattern 

30 processing machines, but there are specific classes of problems for which various types of 
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ANNS are best suited. Being parallel in nature, neural networks are best suited for 
processing intrinsically parallel data-processing tasks. Thus, they are good for problems such 
as image-processing and pattern-recognition, vision and speech-processing, associative recall, 
etc. The characteristics that artificial neural networks hope to provide are: 

1. Tolerance to removal of a small number of processing elements, 

2. Insensitivity to variations between processing elements, 

3. Primarily local connectivity and local learning rules, 

4. Real time response, and 

5. Parallelism. 

There are two major architectural approaches to the implementation of large scale 
ANNs: (i) using a very high speed central processor; and (ii) implementing a fully piarallel 
processing system. The former is the typical ANN simulation on general purpose computers 
and on neurocomputers. The latter is usually found in small-scale, special purpose devices. 
Both of the architectural approaches suffer from most of the following limitations: i% te 
size of a neural network grows by a factor of the interconnections grow by a factor of 
A^. For fully parallel implementation, the capacity becomes an upper limit. For serial 
central processing, the speed decreases as // increases. Some of the electronic virtual 
neuroncomputers implemented on a parallel architecture are: the Connection Machine, Warp, 
AAP-2, and Transputer. However, these architectures have not solved efficiently the 
connectivity problem, as needed for simulation of large-scale neural networks, which prevents 
their expendability. Since these are multiple instruction-stream multiple data-stream (MIMD) 
systems, controller operation expense tends to be disproportionately great with respect to 
processing power. In other words, great expense contributes little to processing power. 

A neuron in an ANN system is capable of accepting inputs from many neurons, and 
capable of broadcasting its activation value to many other neurons in the system through 
weighted interconnections. The ability to "memorize" and to "learn" in a neural network 
system is derived from the weighted interconnections. Every neuron should contribute its 
activation value to the state of the system. For an N neuron system, the potential fan-in and 
fan-out requirement is N'-/ in a fully connected model. This requirement increases 
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Existing ANN simulators are either model-dependent or limited to selective models. As the 
field progresses, new ANN models are constantly being developed. Researchers need an 
implementation tool which would allow modular and reconfigurable realization of ANNs, in 
order to help develop dieir ideas. Current ANN simulators are dedicated to those ANN 
models they intend to simulate. One would need to redesign the system to meet one's special 
needs. What is needed is a reconfigurable and modular architecture implementation of 
ANNs, so that various topologies and size of ANNs may be realized efficiently. 

What is needed for both the theoretical development and the commercialization of 
ANNs is large-scale implementation architectures and technologies. Modular, massively 
parallel architecnires are the most promising in terms of scaling and extending neural system 
capabilities. But the overwhelming problem associated with massive parallelism is efficient 
communication. Massively parallel architecture machines such as Connection Machine, 
NCUBE, and transputer, which are based on a hypercube communications topology, perform 
quite well on certain models with local connectivity. However, the communication structure 
of these machines can lead to dramatic decreases in performance for more general models. 
Also, there are substantial portions of the hardware costs of these machines which are 
dedicated to control units. 

What is desirable in neural network implementation is a universal modular 
architecture, which will allow VLSI hardware implementation of large scale neural networks. 
There are several requirements to the architecture: 

1. be highly parallel, 

2. allow for high hardware utilization, 

3. solve tiie network communications problem without fan in/out limitations, 

4. be cost-effective in meeting performance goals wiUi speed versus hardware 
complexity trade-off, 

5. be modular witii easy interconnection facilities for the design of large systems 
with varying requirements of connectivity and configuration, 

6. be expandable with no added communication problems, while maintaining 
connectivity, to allow operation at raised levels of complexity and neural 
volumes. 
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7. be endowed with switehable connectivity so as to allow for dynamically , 
reconfigurable (by software) neural network system architectures, which may I* 
implement different theoretical neural network models without 
major redesigning or reconstruction, and be implementable with state of the 
5 art fabrication techniques. 

As stated previously, it is hoped that biological neural network computation principles 
can be applied to artificial neural networks (ANNs) to help solve difficult problems of 
recognition, association, optimization and other combinatorially complex problems. But 
massively parallel computers have been developed and have still not attained human 
10 performance in many difficult problem areas. Two key obstacles facing parallel computation 
are problem decomposition (and representation) and communication. In order to use massive 
numbers of processing elements on a problem, the problem must be parallelized and mapped 
onto the processing elements in such a way that dependencies do not cause much of the 
hardware to remain idle. Neural processing can be viewed as such a decomposition and, if 
15 an understanding can be developed to aUow difficult problems to be solved using neural 
techniques, then one of the obstacles will be overcome. If a problem can be cast in neural 
network terms, then we have a parallel decomposition of the problem. 

But there still remains the communication problem. The difficulty with neural models 
is that vast amounts of information must be communicated among neurons in the system. 
20 Three important characteristics of biological neural networks make this possible. First, and 
perhaps most important, is the three-dimensional connectivity structure of biological neural 
networks contrasted with tiie planar coimectivity for int^rated electronics. Second, the 
processing is much more distributed than suggested by the large number of neurons available. 
In biological neural networks, each synapse (corresponding to a weight in the ANN) is a - 
25 processor which performs a multiplication as well as acts as a storage element Accumulation 
of a weighted sum is performed along the dendrites thus making them also large distributed 
processors (accumulators). In fact, since the processing is distributed and largely carried out 
along the pathways between neurons (axons, synapses and dendrites), much of what might 
be considered communication hardware is actually computation hardware. Finally, because 
30 of the long processing times in each neuron, die data rates are relatively low therd)y allowing 
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less communication hardware. In fact, there is evidence that communication delays are 
actually of computational importance in neural algorithms. Pulse coded activations and 
regenerative conduction pathways contribute to robust operation in the presence of noise as 
well as allowing long distance connectivity. 

Most of the processing time used in an ANN is for accumulation of the weighted sum, 
at least in the operation (also called retrieval) phase. The adaptation or learning phase also 
requires computation, as will be discussed later. For a fully connected network of N 
neurons, N^ multiply-accumulates are required to compute all weighted sums for one network 
cycle. The amount of computation involved in computing the activation function varies 
depending on its complexity, but it scales linearly with N since the activation function is 
applied pointwise to the sum vector. Furthermore, most of the hardware in an ANN system 
is devoted to storing the weight matrix. For tiiese reasons, ANN hardware speed is measured 
in interconnections per second, and cq)acity is measured in total number of interconnects that 
can be stored. In a 1988 DARPA neural network study, heavy use was made of a 
performance plane with these two dimensions. These measures do not give a complete 
picture, particularly for general purpose ANN hardware which is not designed for a specific 
application. Additional characteristics which are desired, and found in the proposed 
architecture are summarized below. 

A general purpose implementation architecture must support a wide variety of models. 
This includes support of various connection topologies, activation functions and learning 
paradigms. The degree to which a particular implementation architecture can support a 
variety of models is its flexibility. However, in niany environments, this flexibility is 
preferably carried one step further to programmability and dynamic reconfigurability. For 
example, restucturable VLSI meets tiie criteria of flexibility, but the reconfiguration steps are 
one-way, one-time static restructuring. An ANN workstation, for example, is preferably able 
to be programmably reconfigured to suit the model under investigation. 

The ability of the architecture to be scaled (in some set of dimensions) is its 
extensibility. CPU based architectures are extensible in the size dimension but do not scale 
well along the speed dimension. Modular extensibility allows the system to be scaled with 
the addition of well defined modules and interfaces. Clearly, it is desirable to have modular 



8 

extensibility for large scale ANN implementation architectures. Ideally, modular extensibility 
would include fidd upgradeable expansion of existing systems by the addition of standard « 
modules. 

Although speed, capacity, flexibility and modular extensibility are important, desirable 
5 properties, they all incur some costs. These costs are ideally minimized in a good 
implementation architecture. Thus, efficiency is an important property in an implementation. 
Efficiency is viewed both in terms of implementation efficiency as weU as operational 
utilization. 

An architecture should allow trade offs to be made between the properties listed 
10 above. Trade offs are possible in three major regimes. First, the system designer can make 
trade-offs during the design and construction of a particular instance of the system. Second, 
the modular field upgrades can be made again making performance-cost trade-offs. Finally, 
ANN applications can trade various parameters such as precision with speed during spedfic 
model implementation. 

15 Processing in ANNs can be divided into two distinct phases: the application phase, 

and the adaptation phase. The application phase is commonly referred to as the retrieval 
phase in the literature, and this terminology originates from the use of ANNs as associative 
memories. The application phase is the most consistent among ANN models. The processing 
for each neuron in the application phase for the bulk of the models reported can be 

20 rq)resented by an activation function ^plied to a weighted sum of the inputs to that neuron. 
Although some early models used linear activation functions, modem networks invariably 
employ nonlinear functions such as steps, sigmoids and linear thresholds. Thus, the 
processing for neuron i can be rqiresaited by 

v(r*l),=A.- Sw„v(r)p 

25 Where / is the set of neuron outputs that feed neuron . Taken as a whole, the application 
phase processing can be represented by a matrix-vector multiply followed by a pointwise 
vector activation function. 
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F(f+l)=F(W(r)) 

The last equation describes a discrete time, synchronous system where all neuron outputs are 
assumed to be updated at the same time step. Also, the weights are assumed not to be 
functions of time during the application phase. The weights may vary with time if the 
network is adaptive, but since this variation is performed in a separate phase, and because 
5 the weights variations are usually slow, it is appropriate to represent the weights as fixed in 
the application phase. The components of the vector function are, in general, different from 
one another. 

Not all ANN models perform such uniform processing of all inputs to a neuron. For 
example, some of the neurons in ART treat inhibitory inputs differently froni excitatory 

10 inputs. Other models, such as the Neocognitron developed by Fukushima, combine sums 
from different clusters of input neurons. This can be accommodated in the above formulation 
by defining subneurons of each more complex neuron and then combining their outputs. The 
sums proceed as indicated above, i.e., the combination is performed in the pointwise 
activation function which is now a function of several inputs. Alternatively, but equivalenUy, 

15 inputs to a neuron can be classified and the activation function applied to the class sums. 

The adaptation phase varies considerably among ANN models. It is difficult to express 

y the general adaptation phase processing as succinctiy as the application phase. However, the 

following equations can represent the adaptation processing for most of the mainstream ANN 
models. 



20 



Determination of the Aw varies considerably among ANNs, but the following 
functional form is sufficientiy general to include most ANN models. 

In this equation, / indexes the layer which contains the neuron receiving stimulation 
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via the weight The h terms are some local function of the state of the corresponding neuron. 
This is usually that neuron's output or accumulator value, lii some cases (such as with €• 
supervised learning like Backpropagation) h may include target pattan information local to 
an output neuron. The S terms are summary terms for the corresponding layer. The 
5 generalized delta learning rule uses this to back propagate error gradient information. These 
can also be used in reinforcement learning as nonspecific response grading inputs. The role 
of r is the environmental or critic's input in reinforcement learning. Competitive learning 
neighborhoods can be established by incorporating winner information in S. Learning rate, 
momentum, plasticity decay and other such parameters are incorporated in the overall 
10 function G. 

The above formulation is sufficiently general to cover a wide variety of ANN models. 
These include Perceptron learning, Widrow Hoff, Backpropagation, Hopfield's outer product 
construction, the linear pattern assodator, Kohonen's self organizing feature maps, most 
Hebbian and modified Hebbian learning, Oja's principal component extractor, vector 
15 quantization and ad^tive resonance networks. Because the architecture is programmable with 
a fairly capable instruction set and much flexibility, it is possible to implemoit algorithms that 
are not represented by existing equations. Since the ANN models are constantly being 
introduced and improved, this flexibility is essential. 

20 SUMMARY OF THE INVENTION: 

A Modular Neural Ring (MNR) architecture is described below in accordance with 
the present invention. The MNR architecture is a collection of primitive processing rings (p 
rings) embedded in a global communication structure, which realizes the above described 
requiremrats diesired in a large scale implementation architecture for an ANN. ' 

25 The essence of the MNR architecture is a collection of primitive processing rings 

(pRings) embedded in a global communication structure. The pRings are SIMD machines 
with a control unit parallel serving a large number of attached processing elements (PEs). 
The PE's within a pRing are connected by a local ring communication network. Each pRing 
executes its own control program which synchronously and parallely controls the attached 

30 PEs. However, each pRing is potentially executing a diffeent control program, thus the 
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processing nature of the overall MNR system is MSIMD. 

The specific system architecture which was prototype is a bussed pRing supported by 
a host computer. Each pRing makes a connection to the system bus and to it's left and right 
neighbors. The connections to adjacent pRings allow for logically grouping a number of 
5 pRings to form a larger processing ring (called a slab). The bus is provided for more 
arbitrary communication between slabs. 

The major computation in an ANN is the matrix-vector multiply. But aside from 
comprising the bulk of the processing, this operation also comprises the greater part of the 
communication requirements. For now, attention will be focussed on a fully connected 
10 network (e.g. a Hopfield net), but more general connectivity will be discussed in the next 
section. In this type of network, each neuron (or PE) requires the activation level of all other 
neurons in order to compute its weighted sum. If one neuron is assigned to each PE then N 
multiply-accumulates are required in each PE to complete the weighted sum phase of the 
processing. Each weighted sum is computed sequentially within the corresponding PE and 
15 thus each PE requires only the activation level of one neuron at a time. If the processing is 

properly phased within each PE then the activation levels can be placed on a ring and 
circulated such that they arrive at each PE at just the right time. 

DETAILED DESCRIPTION OF THE DRAWINGS: 

These and other features and advantages of the present invention will be more readily 
apprehended from the detailed description when read in connection with the appended 
drawings, in which: 
Fig. 1 illustrates a biological neuron; 

Fig. 2 is a schematic diagram of a modular neural ring (MNR) architecture; 
Fig. 3 is a schematic diagram of a modular neural ring (MNR) architecture configured as a 

single in accordance with the present invention; 
Fig. 4 illustrates a bussed pRing architecture constructed in accordance with the present 
invention; 

Fig. 5 is a schematic diagram of a primitive ring (pRing) of processing elements (PEs) 
constructed in accordance with the present invention; 
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Fig. 6 is a schematic diagram of three controllers provided with a pRing constructed in 
accordance with the present invention, namely a master control unit (MCU), an 
interfece control unit (ICU) and a PE control unit (PCU); 

Fig. 7 is a schematic diagram of an MCU constructed in accordance with the present 
invention; 

Fig, 8 illustrates a programmer's model of a pRing; 

Fig. 9 is a gr^h illustrating speed and capacity characteristics of analog, fiilly parallel 
architectures, serial central processor architectures and the pRing architecture of the 
present invention; 

Fig. 10 is a schematic diagram of an artificial neural network (ANN) workstation constructed 

in accordance with the present invention; 
Fig. 11 is a schematic representation of a virtual ring composed of several pRings in 

accordance with the present invention; 
Fig. 12 is a schematic diagram of a pRing constructed in accordance with the present 

invention; 

Fig. 13 is a schematic diagram of a PE string board constructed in accordance with the 
present invention; 

Fig. 14 is a schematic diagram of a processor logic block (PLB) provided within a PE string 

board in accordance with the present invention; 
Fig. 15 is a schematic diagram of a shift register simulation scheme constructed in 

accordance with the present invention; 
Fig. 16 is a block diagram of a PCU constructed in accordance with the present invention; 
Fig. 17 is a block diagram of an MCU constructed in accordance with the present invention; 
Fig. 18 is a block diagram of an ICU constructed in accordance with the present invention; 
Fig. 19 illustrates bus transmission timing on an MNR bus constructed in accordance with 

the present invention; 

Fig. 20 is a state diagram of bus transmission protocol in accordance with the present 
invention; 

Fig, 21 illustrates the data processing hierarchy at which the MNR architecture of the present 
invention is programmed; 
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Fig. 22 is a schematic diagram of the MNR language hierarchy in accordance with the 
present invention; 

Fig. 23 is a schematic diagram of subprocesses relationships associated with SimM, a 
simulation tool developed in accordance with the present invention for simulation of 
5 the MNR architecture; 

Fig. 24 illustrates a construction phase of SimM; 
Fig. 25 illustrates a simulation phase of SimM; 

Fig. 26 illustrates global control and program loader modules of SimM developed in 
accordance with the present invention. 
10 Fig. 27 illustrates a pRing module developed for use with SimM; 

Fig. 28 illustrates a HOST module developed for use with SimM; 

Fig. 29 illustrates a global communication control module developed for use with SindM; 

Fig. 30 illustrates a monitor module developed for use with SimM; 

Fig. 31 is a graph illustrating the performance with MNR architecture of the present 
15 invention on a DARPA; 

Fig. 32 is a graph illustrating the effects of speed versus the number PEs on the MNR 

architecture of the present invention; 
Fig. 33 is a graph illustrating the effects of speed versus the neuron PE ratio on the MNR 
architecture of the present invention; 
20 Fig. 34 is a graph illustrating Uie effects of speed versus the number PEs on the MNR 
architecture of the present invention; 
Fig. 35 is a graph illustrating the effects of speed versus the neuron PE ration on the MNR 

architecture of the present invention; 
Fig. 36 is a graph illustrating the effects of speed versus pRing size on Uie MNR architecture 
25 of the present invention; 

Fig. 37 is a graph illustrating PCU utilization versus pRing size of the MNR architectiffig. 
38 is a graph illustrating ICU utilization versus pRing size of the MNR architecture; 
Fig. 39 is a graph illustrating speed versus communication bandwidth of the MNR 
architecture; 

30 Fig. 40 is a graph illustrating PCU utilization versus communication bandwidth of the MNR 
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architectuie; 

Fig. 41 is a graph iUustiating ICU utilization versus communication bandwidth of the MNR $ 
architecture; 

Fig. 42 and 43 are graphs comparing device utilization of the PCU and ICU; 
5 Fig. 44 is a graph illustrating speed versus communication bandwidth of the MNR 
architecture; 

Fig. 45 is a graph illustrating PCU utilization in the MNR architecmre of the preset 
invention; 

Fig. 46 is a graph illustrating ICU utilization in the MNR architechire of the preset invention; 
10 Fig. 47 is a graph illustrating speed versus the predsion of the MNR architecture of the 
present invention; 

Fig. 48 is a graph illustrating PCU utilization versus the precision of the MNR architecttire 

of the present invention; 
Fig. 49 is a graph illustrating ICU utilization versus the precision of the MNR architecture 
15 of the present invention; 

Fig. 50 is a graph illustrating performance characteristic of the MNR architecttire of the 

present invention; 

Fig. 51 is a graph iUustrating cost and performance estimates of the MNR architecttire of the 
present invention; 

20 Fig. 52 is a graph is a graph illustrating the performance of a two pRing MNR prototype; 
Fig. 53 is a graph iUustrating the performance of a forty pRing MNR prototype; 
Fig. 54 is a graph illustrating the performance of a multi-layered feed forward M^JR 
protoQrpe; 

Fig. 55 is a graph illustrating the performance of an error back propagation (BP) MNR = 
25 protoQ'pe; and 

Fig. 56 is a graph illustrating the PE utilization of BP utilization of BP implementation in an 

MNR prototype. 
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DETAILED DESCRIFnON OF THE PREFERRED EMBODIMENTS: 
» PRINCIPLES OF OPERATION OF THE MODULAR NEURAL RING (MNR) 

ARCHITECTURE 

The basic processing ring of the present invention is configured as a synchronous ring 
5 communication network with K processing elements (PEs) situated radially off the ring as 
shown in Figure 2, Although the ring is the basic theme, variations are presented to suit 
different ANN topologies. The simplest case to describe is a fully connected N neuron 
network assigned to a ring of N PEs (K=N). It is also a simple matter to have K < N, 
thereby allowing one PE to serve several virtual neurons. The PEs 30 operate synchronously 
10 and in parallel on data delivered to them by a data vector, which circulates in the ring 
communications network 20. 

The architecture of the neural ring is basically a Single Instrucdon-stream Multiple 
Data-stream (SIMD) processing structure with sequenced delivery of identical data sets to the 
PE 30, rather than partially processed data through each processor, as may happen in 
15 pipelined processing of systolic architectures. The operations of the neural ring are highly 
parallel, allowing for very high processing element utilization due to timely delivery of data 
to the PEs. The ring communication structure allows for simple interconnection and 
extensibility schemes, without major layout problems stemming from fan in/out or 
connectivity requirements in VLSI implementations. The regularity of the neural network 
..,20 topology also allows efficient replication of the rudimentary PEs, which are served in clusters 
by the more complex control unit. 

To understand the operation of the neural ring, consider the case of a fully connected 
neural network with K PEs and N neurons, where K-N. A network cycle is completed when 
^ the activation outputs of all N neurons have been updated. The cycle requires that tiie 

25 weighted sum for every neuron's input in the systems is accumulated, i.e.n multiplications 
^ and additions. These operations are carried out concurrentiy by the K PEs, where each 

PE performs a multiplication and an addition or accumulation (MAC) in one primitive 
computation time. 

At the start of a network cycle, each PE places its neuron's value on the ring and 
30 performs a MAC operation. During Uie MAC, the data are rotated clockwise on the ring- 
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an operation that is designed to take the same time as the MAC. Each PE then perfonns a 
MAC with the new value on its ring input while tiie activation vector moves another stq» on 
the ring. When the activation vector has made a complete tour of the network, all weighted 
sums have been accumulated and the activation function is applied in each PE. The system 
is now ready for another network cycle. Note that under die assumptions that the ring step 
time is equal to tiie processor MAC time there is no processor idle time. Synchronously, the 
computed sums are mapped through the nonlinear activation functions of each neuron (e.g., 
hardlimiter, sigmoid, etc.) to produce the neuron outputs, and tiie next network cycle begins. 
Since the data-circulation time is overl^ped with the primitive computation time, there is no 
PE idle time within the network cycle. 

Each PE is preferably supported by a weight memory, which represents a horizontal 
slice of the weight-matrix, and by an accumulator-memory which allows each physical PE 
to serve multiple neurons. This provides a ready mechanism for cost versus speed tiade-offs. 
It is also flexible enough to accommodate various adaptation algorithms since each PE has 
access to a slice of the weight-matrix. Different activation functions can be achieved using 
polynomial approximations. State-dq)endent behavior can be implemented using the 
associated accumulator-memory to store state information. The pointwise functions and 
matrix-multiplication required for the learning operation can also be canied out in a similar 
manner, altiiough the logical connection topology for the learning phase of the network 
operation is often different from tiiat of tiie application phase. 

Although tiiere is great regularity in die connectivity and in die processing performed 
by neurons in a network, for most nuxiels tins regularity applies, for tiie most part, locally 
to groups of neurons. Haice, the ring communication structure is most suitable locally. A 
diffwent global control and communication structiire is usually required for die entire neural 
network system. If the architecture is to be used to implement a variety of models, tiien a 
dynamically reconiigurable global structure is needed. 

The locality property of typical neural networks ref«s to die rule tiiat if a neuron 
connects to anotiier neuron in die network, it is likely also to be connected to that neuron's 
neighbors. Thus, if a neuron requires anodier neuron's value, it is likely to also require the 
values of tiie otfier neurons in its neighborhood. A pRing (primitive Ring) is an indivisible. 
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ring-configured collection of processing elements. The idea is to assign groups of logical 
* neurons to pRings in such a way as to take advantage of local regularity, but still be able to 

accommodate global specialization. In accordance with the present invention, the collection 
of pRing characterizing the MNR architecture comprises a communication structure having 
5 a common bus and bidirectional connections between adjacent pRings. Figure 3 shows an 

MNR system of three consecutive pRings configured as a single neural ring. The connecting 
? adjacent pRings allows for several consecutive pRings to be configured as one, large, virtual 

ring 42 which can represent, for example, a neural slab. The bus connection 44 allows for 
communication between pRings that are not adjacent. A host computer 46 is connected to 
10 the bus and serves as the data and command I/O processor. 

The pRings 36, 38 and 40 are SIMD machines comprising a central control unit48, 
50 and 52, respectively, which serves a large number of peripherally attached PEs operating 
in parallel. The PEs within a pRing are connected by a local ring-communication network. 
Each pRing executes its own control program, which synchronously controls in parallel the 
15 attached PEs. However, each pRing can potentially execute a different control program from 
other pRings, thus, making the processing nature of the overall MNR system MIMD. 

An SIMS ring architecture is generally known. In accordance with the present 
invention, an SIMD ring is modified to be primitive to all for the employment of modular 
^ primitive rings (pRings) as componems in an ANN system. 

20 With untried reference to Figure 3, the processing for all PEs within a ring is 

preferably identical, with phased delivery of the data vector to all PEs in the ring. Such a 
system is a prime candidate for SIMD/systolic processing. Thus, each PE can be kept quite 
simple with the more complex control unit being used only once per ring. (In a general 
=^ purpose CPU, more circuitry is dedicated to control than to computation.) VLSI 

25 implementation exploits the remarkable regularity of such a system to help minimize 
development costs. The architecture can be implemented with a central controller which 
broadcasts instructions to PE strings. PEs within a string are connected point-to-point in a 
line. The strings are concatenated and the end strings connected together to form a ring. 
With such an implementation, the system is modularly extensible and even field upgradable 

30 by inserting additional PE strings. The additional PEs place no additional communication 
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burden on the system since each brings along its own point-to-point connection to its 
neighboring PEs, Thus each PE requires only a pair of simpl« communication links. 

With reference to the single ring architectures, there are some shortcomings. When 
the virtual topology is other than fully connected, the efficiency of the system can deteriorate. 
The system performance can also degrade if K is not compatible (in a sense to be defined 
later) with N. If the goal is a general purpose architecture on which to run various ANN 
models (an ANN workstation for example) then the ring may not be the best structure. 

In order to attain maximum PE utilization the processing load is preferably evenly 
distributed among the PEs. Also, the communication network should be able to deliver the 
data where and when it is needed. Since the ring is SIMD, then K must divide N. But, since 
this is unlikely for an arbitrary network, null neurons can be introduced to augment N. But 
the processing associated with the null neurons is equivalent to processor idle time since the 
model did not actually call for it So an ANN model with 502 neurons implemented on a 
ring of 100 PEs will require 98 null neurons resulting in a real utilization of about 70%. 

The communication network may also lead to a degraded utilization by PEs waiting 
to receive for data and thus making them idle. A locally connected network implemented on 
a unidirectional ring will still require an entire network cycle to send the data to every PE 
tiiat requires it, however, for much of the time, the PEs will be receiving data that they 
cannot use. The solution, as suggested above, is to use a bidirectional ring, but each ANN 
topology will have its own requirement on the network and there are substantial costs 
associated with general connectivity schemes. 

Slab oriented ANNs, such as the Neocognitron and the Back Propagation Net, can 
compound the utilization problem. For such systems implemented on a single SIMD ring, 
a time multiplexed assignment of slabs onto die ring is required. Processing will proceed in 
major phases with each major phase corresponding to a different slab. Now utilization losses 
due to null neuron augmentation will occur for each slab. Furthermore, the communication 
of data may be much more complex. 

The general structure of a bussed-pRing architecture is shown in Fig. 4. The 
bussed-pRing architecture is flexible enough to support a wide spectrum of virtual topologies 
efficienUy. Each pRing 36, 38 and 40 makes a connection to the system bus 44 and to its 
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left and right, adjacent pRings. The connections to adjacent pRings allow for logically 
grouping a number of pRings to form a larger processing ring (called a slab). The bus is 
provided for more selective communication between slabs. 

The pRings are primitive vector processors which operate on data vectors delivered 
to them by a global communication infrastructure. Thus, the system can be viewed as a 
parallel processor whose components are parallel vector processors. The task of the 
global communication system is to transport data vectors to the vector processors. The 
complexity of the global communication system is then a function of the number of pRings 
(M) in the system rather than the number of PEs (K) in the system. The reduction in 
complexity is then some function of the number of PEs in a pRing (k) which can be 
several hundred. It is therefore possible to consider traditionally poor scaling topologies 
in terms of cost such as cross-bar and full interconnect, although modularity and 
extensibility may be sacrificed. 

If the number of logical slabs in the ANN model is small and the number of 
pRings in the MNR implementation is large, then each slab can be assigned to several 
pRings. In this case, the pRings can be connected in a string and contiguous pRings are 
assigned to a logical slab. This connection and assignment scenario is a recurring theme 
in ANNS and suggests string connections between pRings. 

In effect, several pRings strung together as such are viewed as a larger, virtual 
neural ring if a path is provided between the end pRings in the string. However the 
location of this feedback path is dependent on the particular ANN model being 
implemented. If all possible feedback paths were provided, the connection topology can 
degenerate to fiiU, point-to-point connectivity whose hardware complexity scales poorly 
with M (order M^) and is not easily extendable. There are other connection topologies 
such as toroid, hypercube and many multistage networks, which scale much better, that 
can be used. 

In addition to providing feedback paths to close virtual neural rings, the global 
communication structure preferably also provides for communication between slabs. This 
communication task is more demanding since there is little additional communication 
regularity that can be extracted among the various ANN models. 
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When analyzing candidate communication structures, two characteristics of the 
communicating pRings are important to consider. They both stem from the fact that while 
massive numbers of PEs are employed, they are akeady organized into moderately sized 
groups, i.e., the pRings. This effectively reduces the communication problem from that 
of interconnecting K processors to the smaller problem of interconnecting M=K/k 
processors. Whereas realistic values of K can be in the 10^ to 10* range, M is more likely 
to be limited to the 10 to 100 range. It is no longer necessary to exclude complexity 
topologies from consideration. 

The second characteristic is that communication between pRings involve vectors of 
size k. This reduces the importance of address overhead costs and improves the case for 
circuit switched rather than packet switched networks. It also reduces the importance of 
network diameter since the diameter affects the latency more than the throughput. Effects 
of diameter on throughput are more indirect, via congestion. Since the communication is 
overlapped with the processing a small increase in communication bandwidth will 
overcome the effects of latency. That is, once a trail is blazed from a source to a 
destination pRing, the data vector flows at the fiiU burst communication rate. 

Although many communication structures can be used to augment the pRing string 
structure, a bus is preferably used to exemplify an MNR architecture as dq>icted in Hg. 
4. The bus, together with the string communication structure provides the communication 
infrastructure for the Bussed Modular Neural Ring Architecture (BMNR). The bus 
hardware requirements scale linearly with M and pRings can be modularly added to 
increase the system size without modifying each pRing (the communication degree of the 
each pRing is constant). The communication time is approximately constant except for the 
small overhead for communication set up. 

The MNR architectures contain a number of pRings, each of which is controlled 
by its own, potentially unique, program. At the pRing level, processing proceeds in a 
synchronous, SIMD fashion. At the system level, however, the pRings are more loosely 
coupled executing separate, asynchronous programs. Overall system synchronization is 
attained partially by the pRing programmer and partially by asynchronous handshaking 
between pRings. Thus, processing at the subvector level proceeds in systolic array 
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fashion while system level processing is more asynchronous in nature. 

• As discussed earlier, appropriate choices of SIMD array size (k) versus number of 
vector processors (M), for ANN implementations with sufficient local processing and 
communication regularity, can lead to flexible as well as efficient systems. Fully SIMD 

5 systems suffer utilization loss due to fragmwitation while fiiUy MIMD systems incur large 
control cost overheads. 

The core of the MNR architecture is the pRing. A pRing block diagram is shown 
in Fig. 5. As discussed above, a pRing consists of a number of primitive processing 
elements (PEs) 54,56, 58 and 62 served by a local ring communication structure 64 and 

10 managed by a local centralized control unit (CU) 66. The CU is coupled to a control 

memory 67. Each pRing is interfaced with the global communication system 68 via the 
interface unit (I/F) 70. The interface, unit is also supervised by the control unit 66. For 
the bussed MNR architecture discussed above, the interface unit provides connections to 
the left and right adjacent pRings and to the bus. 

15 The ring communication structure within a pRing is actually comprised of three 

concentric, independait rings: die P ring 72 for vector processing, the R ring 74 for 
vector receiving and the T ring 76 for vector transmission. These rings are formed from 
three registers 78, 80 and 82 at the entry to each PH. The processing ring,^ made up of 

v> the P buffer associated with the PE's, is usesd to circulate an input vector to all of the PEs 

^0 on the ring. This circulation occurs at the PE processing rate. The R buffers are used as 
an input vector receiver and staging area, so that when the PEs have finished using the 
vector in the P ring they can immediately begin working on the received vector from the 
R ring. Thus the communication of vectors between pRings can be completely ovCTlapped 

* with the processing. The R buffers form a chain rather than a ring, since there is no need 
25 for the received data to circulate within a receiver ring. The final set of buffers form the 

transmission chain rather than a ring. This chain is used to hold a data vector for 
transmission to another pRing. 

The processing elements themselves are preferably configured to be simple. They 
are all controlled in parallel by a central control unit within each pRing. Thus, only the 
30 actual data manipulation parts of the arithmetic unit need to be replicated for each PE, 
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resulting in substantial hardware savings in implementation of a pRing. This also aflows 
for speed versus space tradeoff between parallel and serial arithmetic unit to be made 
without incurring the control overhead costs associated with the more serial approaches to 
each PE. Parallel arithmetic is the fastest and uses the most hardware, while the opposite 

5 is true for bit serial arithmetic. Since the control overhead costs have been minimized by 
the central controller, the time-space product remains fairly constant Thus, the low speed 
of the serial PE is compensated for by the large number of PE's which could be placed on 
a chip. The advantage to parallel arithmetic is that the decrease in the number of PEs 
leads to somewhat better utilization, due to a reduction in fragmentation. 

10 The serial PE, on the other hand, lessens die communication bandwidth problem 

by increasing the time spent processing an individual data vector. More importantiy, witii 
bit serial arithmetic, a time and space versus precision trade-off can be made possible and 
easy with programming. Thus, weight memory is bit-addressable and significant savings in 
memory can be achieved. Further, lower precision wdghts can become adequate for a 

15 given model. A similar argument holds for trading speed with precision. For large 

models, the fragmentation issue is of less significance and the advantages offered by serial 
arithmetic are substantial. 

As shown in Fig. 5, pRing operation is controlled by the pRing control unit (CU) 
66. The CU executes a program stored in the attached control memory (CM) 67. The 

20 control unit actually comprises three individual controUers and a control memory 84 as 
shown in Fig. 6. 

The master control unit (MCU) is responsible for fetching instructions from the 
control memory and either executing them directiy or dispatching them to the otiier 
control units. As wiU be described below, the MCU also contains a micro-processor and a 

25 number of registers, which are used for a scratchpad and housekeeping for executing a 
pRing program. Inter-pRing communication instructions are forwarded to the inteifece 
control unit aCU) which has the responsibility of controlling data vector transmission and 
reception. PE instructions are dispatched to the PE control unit (PCU) which decodes 
them and broadcasts control sequences to all of tiie attached PEs. The PE control unit can 

30 execute "st^ and repeat" instructions which allow it to autonomously process an entire 
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vector circulation cycle. Instruction execution in each control unit is overlapped with 
execution in the other two. The MCU, however, performs instruction fetching for all 
three controUers. This configuration is not a bottleneck, because the ICU and PCU 
instructions typically operate on entire vectors of data. 

5 Fig. 7 is a block diagram of one embodiment of the MCU (and attached control 

memory. It is organized much like a general-purpose microprocessor with the PCU and 
ICU operating essentially as attached co-processors. The registers 94 and ALU 96 shown 
preferably do not participate directly in the vector data processing performed by the 
pRing. Instead, they are used to control operation of the pRing program. Local 

10 instructions are latched into the instruction register for tiie MCU and carried out by the 
MCU. Instructions destined for tiie ICU or PCU are automatically sent to the appropriate 
controller. Also, operands required for these instructions are automatically fetched by tiie. 
MCU and forwarded to the destination controller. Synchronization instructions are 
provided to ensure that the ICU and PCU are not overrun with instructions. 

15 As discussed above, the pRing is controlled by a program which resides in the 

control memory, is fetched by the MCU and is executed on the MCU, PCU and ICU. An 
instruction set, an instruction encoding format and an assembly language for the pRing is 
provided in accordance with the present invention. The instruction set and machine code 
definitions are provided in Appendix A. The programmer's model of tiie pRing is given in 

20 Figure 8. In tiiis representation, the programmer has direct access to die local MCU 
registers 94, ALU 96 and control memory 84. The PEs are viewed as a processing 
subsystem over which the programmer has control but no direct access to the data. 
Similarly, the programmer can control the interface unit but cannot directiy access it. 

The pRing instruction set looks largely like Uiat of a typical microprocessor. But in 

25 addition to Uie usual instructions, special insttoictions for controlling the PCU and ICU are 
also included. These instructions include step-and-repeat forms for the PCU and 
vector-block forms for the ICU. Conditional transfer instructions can be used to test the 
status of the other two control units and thus provide a level of programmed 
synchronization between die units. The job of the pRing program is to choreograph vector 

30 processing and system communication at the pRing level. 
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The instruction set is divided into three groups: local instructions, PE instructions 
and interfece instructions. This division is reflected by a two bit field in the opcode field 
of the instruction format. The MCU monitors this field for passing each instruction to the 
appropriate controller. 

5 There are five addressing modes, which are made available for most instructions. 

These are immediate, direct, register, register indirect and memory indirect When 
fetching operands for ICU and PCU instructions, these operands are never used direcUy as 
data. Instead tiiey are used as instruction parameters such as weight memory offsets and 
base addresses. 

10 Due to the diverse nature of the available ANN implementation architectures, 

comparing the performance of one to another is very difficult. However. DARPA used 
two generalized variables, referring to speed and capacity, as assessment criteria on the 
graph depicted in Fig. 9. Capacity is measured by neuron-interconnections, indicating the 
size of the ANN model which is implemented. Speed is measured by 

15 interconnections-per-second, indicating the rate of primitive ANN operations (i.e. . 

multiply-accumulate). On tiie speed/capacity plane, fuUy paraUd architectures 104 reside 
in die upper left comer. Serial CPUs 106 are positioned in tiie lower half of the plane 
(Figure 9). Fully parallel architectiires may perform at very high speed but are severely 
limited in capacity. On die otiier hand, serial CPUs are limited in speed but they are quite 

20 expandable in capacity (provided Uiat there is adequate memory available). The MNR 

aichitecttire 108 is designed to compromise performance and to fill the gap between fully 
parallel implementation architectures and simulation on serial CPUs. Simulation results 
described below substantiate tiiat die performance of the MNR architectiire resides in the 
diagonal region between die fully parallel implementation architectiires and die simulation 

25 on commercial digital compute. 



MNR ARCHITECTURE PROTOTYPE 

One of many possible MNR architecture prototypes, which has been designed and 
constructed using off die shelf discrete integrated circuit components, wiU now be 
30 described. A production modular pRing neural computer can also be designed and 
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implemented using custom or semi-custom VLSI components to reduce size and power 
■i requirements and to increase performance. The production unit of a modular pRing 

neural computer is somewhat different from what is described herein when VLSI 
components are used. The required design deviations, however, do not affect 
5 deleteriously the MNR architecture's capabilities. 
A. Overall System Level Design 

The purpose of implementing the prototype described herein is three fold. First, 
the prototype serves as a proof of concept model by demonstrating, from a real 
engineering perspective, that the architecture is practical. The prototype also provides a 
10 vehicle for further study of the problems associated with such architectures so that a 

commercial version can reap the benefits of engineering decisions made on the basis of 
data from a real machine. Finally, the prototype allows true performance measurements 
to be taken with an accuracy that can not be achieved through simulation and analysis. 

The goal is not to produce the fastest, the most efficient, the most flexible or even 
15 the most elegant design. The prototype is simply an evaluation and test platform for 

architectures in the MNR family. 

As such the prototype is of modest size and is implemented with mature (if not, in 
some cases, old) technology. The prototype consists of five pRings with up to 40 PE each 
for a total ofi:200 PEs which is much smaller than desired for a high performance 
20 commercial product. The PEs perform low level operations at the rate of 200 ns per bit 
which is far from state of the art. Much of the design uses fairly low complexity 
programmable logic arrays (PLAs). Thus the performance measurements taken from the 
prototype are somewhat low. These numbm need to be scaled to bring them in line with 
' state of the ait commercial technology and more respectable system size. 

25 An ANN implementation workstation was implemented. In this system, the MNR 

* subsystem is connected as a coprocessor 1 12 to a general purpose host compute 1 14 as 

shown in Fig. 10. The host computer 114, i.e., an 80386 class p«sonal computer (PC), 
serves as the user interface for the ANN workstation and hosts the ANN software 
development tools. It is also responsible for downloading (to the MNR coprocessor) 
30 initial weight values, pRing programs and data as well as supervising the overall operation 
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of the MNR system. 

Mthin the MNR subsystem is a number of pRings 116, 118 and 120, each of ^ 
which is connected to its two immediate neighbors and connected to the MNR bus 122. 
The adjacent pRing connections allow several pRings to be assigned to a larger virtual 
5 ring and they support adjacent layer communication in certain layered models. This is 
shown in Fig. 11. The bus connection is provided for more general intra-pRing 
communication requirements and for closing the loop in larger neural rings constructed 
from multiple adjacent pRings. The communication bandwidth of the bus, therefore, is a 
function of the degree of local specialization and intra-slab communication and not a 
10 function of the absolute ANN size. In fact, as it happens, the larger the ANN model, for 
fixed MNR system and fixed ANN architecture, the lower the communication bandwidth 
requirements. 

pRings are interconnected via ribbon cables, and the Host to MNR coprocessor 
interface is preferably provided by a special purpose interface cards (not shown). The 

15 interface card allows the host computer to control the coprocessor operation by allowing the 
host to become an MNR bus member. 

The core of the prototype system, and for that matter, any MNR architecture, is the 
pRing. Fig. 12 is a block diagram of a pRing 116 for the BMNR workstation. It consists 
of three centralized control units and a number of primitive processing elements (PEs) 

20 connected in a string. The string of PEs 128, 130, 132 and 136 emanates from and 
terminates in the Interface Control Unit (ICU) 138 where die string can be closed to form a 
ring 140. 

Each PE is a relatively primitive processor (i.e., only an ALU) with associated weight 
and accumulator memory as well as a stage of the communication string. These stages form 

25 a collection of shift registers with the shift dimension being along the PE string. The shift 
registers so formed are used for data vector reception, transmission and circulation for 
processing. Reception, transmission and circulation operations can occur simultaneously and 
independentiy so that communication and processing can be overlapped. Note that the shift 
register file is situated outside the ALU so that data need not pass through the processing 

30 circuits as is often done with systolic arrays. 
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The centralized controllers in the pRing are the Master Control Unit (MCU) 142, the 
Processor Control Unit (PCU) 144 and the Interface Control Unit (ICU) 138. The MCU 
controls the overall operation of the pRing and it is this unit that executes the pRing program. 
The MCU controls the ICU and directs the instructions to the PCU. 
5 The PCU (Processor Control Unit) 144 has direct control over the processing elements 

and their attached weight and accumulator memories. It receives a stream of 
macro-instructions from the MCU and decodes these instructions into control signal and 
memory address sequences which it broadcasts, in parallel, to all PEs in the pRing. It also 
direcdy controls the shift register file and provides arbitration and handshaking for the ICU 
10 to access the shift registers for communication purposes. 

The ICU (Interface Control Unit) is responsible for vector communication with other 
pRings. It controls data vector transmission and reception in the PE shift register file 
indirectly through the PCU. This is the unit most affected by the global communication 
structure. 

15 Photo 1 shows a prototype pRing used in the system. The PCU serves as the mother 

board for the pRing, hosting up to five PE string boards oriented v«tically on one end of the 
PCU. The MCU and ICU can be combined into one PCB (printed circuit board), the MICU 
(Master and Interface Control Unit), which is "piggy-backed" on top of, and parallel to, the 

7 PCU. 

20 B. The PE String Module 

The PE string board contains eight processing elements (PEs) including their 
associated accumulator and weight memories and shift register circuitry. Fig. 13 is a block 
diagram of the PE string board. PE string boards are designed to be concatenated to form 
arbitrarily long strings of PEs. As shown in Fig. 13, parallel address and control lines are 
:25 used for all PEs and one bit is used to cascade PE string boards to form larger chains. 

The core of the PE string board is the array of eight Processor Logic Blocks (PLB) 
150 that perform the computation. Each PLB is an ALU for one PE. Fig. 14 shows the 
detail of a PLB. Logically, the PLB consists of a small number of one bit registers, a data 
exchanger 152 and an ALU 154. The registers include flag registers C (carry) 156 and Z 
30 (zero) 158 as weU as three operand registers Y, Q and WD indicated by 160, 162 and 164 
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10 



15 



20 



25 



30 



respectively. The data exchanger allows the data movement operations listed in Table!. The 
ALU implements the bit operations listed in Table 2. 
Table 2. Operation Code Table 



C 

OOOO 

DPPP 

E210 



0000 



1001 



2010 



3011 



4100 



5101 



6110 



7111 



LOAD PHASE 
OP3*PHASE 



WO 



/Q*A+Q*WI 



/Q*A 



Q*A+/Q*WI 



Q*A 



Q*WI 



WI 



/Q 



OPERATE PHASE 
0P3*/PHASE 



AD 



Y*WO 



Y*WO 



Y*WO*C 



Y*WO"C 



Y*WO 



Y*WO 



Y*WO"C 



Y*WO"C 



C 

"o" 



0 



Y*WO+Y*C+W 

o*c 



/Y*WO-l-/Y*C+ 

wo*c 



0 



0 



INTT 



/OP3*/P 
HASH 

C 



Q 

T 



c 



WI 

"q" 

0 



Table 2. Exchange Code Table 



c 




/0P3 




XD 


oooo 

DPPP 


AD 


Y 


WDO 


/0P3 


0P3 


E210 












0000 


Y 


AD 


AD 


SD 


SD 


1001 


Y 


AD 


SD 


WDI 


WDI 


2010 


SD 


AD 


Y 


AD 


AD 


3011 


SD 


AD 


Q 


WDI 


WDI 


4100 


Y 


WI 


AD 


Q 


Q 


5101 


Y 


SD 


SD 


WDO 


WDO 


6110 


SD 


WI 


Y 


Y 


Y 


7111 


SD 


SD 


Q 


c 


c 



0 



0 



0 



The PLB is actually implemented in a field programmable logic array (PLA) so its 
physical structure is not as depicted in the figure. Also, certain Umitations were imposed by 
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the limited number of product terms (7 or 8) available in each output logic macrocelL 
However, the operations listed in the table are sufficient to implement the instruction set of 
a pRing. 

The pRing block diagram shown in Fig. 12 calls for at least three shift register stages 

5 per PE where a shift register stage can hold one, maximum precision, data element. The 
maximum precision originally chosen was 16 bits. This would require 48 bits of shift register 
for each PE. Implementation using discrete components would have required six chips (8 bit 
shift registers) per PE for a total of 48 shift register chips on the PE string board. This is 
excessive when compared to the only 1 1 chips used for everything else on the board. 

10 Thus a shift register simulation scheme shown in Fig. 15 is used in accordance with 

the invention which implements all of the shift registers using only four chips including a 
memory 166 and a latch 168. The large chip count for the direct shift register 
implementation is due to the off-the-shelf, discrete component constraint. In a VLSI 
implementation, this would not be as serious a problem. However, the RAM based shift 

15 register technique bestows other advantages on the design. First, the precision need no 

longer be limited by the lengtii of the shift registers since RAM is so economically fabricated 
(compared to direct implementation shift registers). Next, there is no longer a requirement 
for hard partitioning of the shift registers in to the three types required. Instead, the shift 
register memory forms a shift register file which can be partitioned in software into the 

20 number and type of virtual shift registers needed. Finally, when implementing lower 
precision operations (such as multiply) there is no need to waste clock cycles by clocking an 
operand to the head of a register. With this implementation, the shift register file consists 
of 256 one bit shift registers that can be partitioned into various sizes and numbers of word 
shift registers. Operand widths of up to 64 bits are easily accommodated. 

25 The main disadvantage of this scheme is that more complex control circuitry is 

required to implement the shift register simulation than to implement shift registers directiy. 
But this control circuitry is manifested as a one time cost in the PCU, rather than a recurring 
cost for each PE, so even with only one PE string board, there is still a net savings. A small 
penalty in speed is paid for this expedient, but Uiis is not die critical path in the system so it 

30 is affordable. 
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Each PE contains a one bit ALU with the instruction set repertoire is given in Tables 
1 and 2. Word level operations are accomplished by multiple bit level operations as in any s 
bit serial ALU. The control for these concatenated control sequences is provided by the 
PCU, thus the SMial ALU control overiieads, which are a one time cost in the ECU, are 
5 amortized all PEs in a pRing. Rather than using shift registers for operands and results as 
is usually done in serial ALUs, the data memories 166 are, themselves, used as shift registers 
by cycling the addresses appropriately. If a single operand memory were used instead of 
separate weight and accumulator memories, two reads and a write would be required for each 
bit operation, thus the memory is broken into a large weight memory and a small accumulator 

10 memory. But this still creates a speed bottle-neck at the accumulator memory which must 
be accessed for one read and one write in most operations- For this reason, a small, fast 
memory is chosen for the accumulator which can be cycled twice in the time required to cycle 
the weight memory. The waght memory is necessarily the largest memory ^jnce the number 
of weights dominates the size of the ANN. 

15 The memories are bit addressable, so that maximum use can be made of available 

memory resources. Models requiring less precision, then, benefit in both time and number 
of connections available. The maximum precision in this implementation is 64 bits. 
C. The Processor Control Unit (PCU) 

The Processor Control Unit (PCU) controls the processing elements (PE) in a pRing. 

20 It is a single board with connectors on one end for the PE string daughter boards and an 
interfece to which MICU (Master and Interface Control Unit). The PCU can host up to five 
PE string daughtCT boards for a total of 40 PEs. the function of the PCU is to emit 
addresses and control signals to the attached PE strings in order to effect the processor 
instruction set The PCU requests macro instructions and parameters from the MCU and * 

25 carries these out by sending address and control sequoices to the PEs. It also has direct 
control over the PE string boards* shift register logic for transmission, recqption and 
circulation of data vectors. The ICU, however, is actually responsible for sending and 
receiving data vectors from other pRings. 

Fig. 16 shows the major logical blocks of the PCU. They include the clock and strobe 

30 goieration block 172, instruction memory and sequencer 174, the PE in^iiace 176, the 
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MCU/ICU interfece 178 and the parameter memory and accumulator 180. Although there is 
some overlap in these units in the actual implementation, logically this is a good 
decomposition. 

The core of the PCU is the instruction memory and sequencer 174. The PCU is 
configured as a microcoded controller with a writable control store (WCS) 182. There is 
preferably no read only instruction memory so the controller must be boot strap loaded by 
the MCU. All PCU instructions are single, 40 bit words. Tj^le 3 summarizes the instruction 
word format. 



10 



Table 3^ PCU Instruction Woad FsDIiat 





BIT 


GROUPO 




GROUPl 




0 






IMEDOE 








1 






AASEL 








2 




ACS 








PCS 


15 


3 




PHASE 








PRMOPO 


4 




PCLK 








PRMOPl 




5 




AWE 








PWE 




6 




INCWA 








LDWA 




7 






AOE 






20 


8.. I 




OP(0..3) 








PAD(0..3) 




12.. 14 






SHOP(0..2) 








15 




GROUPO 








GROUPl 




16 




INCAA 








LDAA 




17 




WOPO 








PINSELO 


25 


18 




WOPl 








PINSELl 


19 






NOT USED 








20 






NOT USED 








21. .23 






PCOP(0..2) 








24.. 39 




I] 


MED(0..15) 




30 




WOP(0..1) 






SHOP(0..2) 




00 








100 


Read Shift Memory 




01 


Read Weight Memory 




101 


Write Shift memory 




10 


Write Weight Memory 




110 


Read&latch shift data 




11 








111 


Latch XD (exchg data) 


35 




PCOP(0.,2) 






PCOP(0..2) 




000 


INC 






100 


JPCY 






001 


IMP 






101 








010 


DJZ 






110 


DECCTR 




Oil 


JPACK 






111 


LDCTR 
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00 
01 

10 

11 



PINSEL 

Counter Select (CTR2P) 
SUM Select (SUM2P) 
External Select (EXT2P) 
SUM2P&OUT PRQ (IMD15) 



00 
01 
10 

11 



PRMOP(0..i) " 
HOLD 

LOAD ADDER VALUE 
COUNT 

LOAD IMMEDIATE VALUE 



Each word includes a 16 bit immediate data field and 24 additional bits for various control 
functions. The control store accommodates 2048 forty bit instructions. It is sequenced by 
an 11 bit program counter (PC) and it has a 2048 word instruction memory. The PC is 

10 augmented by a loop counter and condition code selector. In addition to the defiault increment 
of the PC, it can also be loaded from the immediate bus to implement conditional and 
unconditional jumps. The jump conditions are PCU status information such as the loop 
counter bang zero, and they do not generally include status about data being processed in the 
PES. The immediate bus can be driven by either the instruction memory or by the parameter 

15 memory under instruction control. Thus indirect jumping and subroutine linkage using 
parameter memory locations is available. This also allows for dispatching microcode routines 
from instruction addresses specified by the MCU. These addresses define the macro 
instructions as seen by the MCU. It is important to note tiiat the appUcation programmer 
need never program this complex unit Instead, the PCU is used as a controller to give the 

20 pRing its data processing instruction set 

A loop counter is provided for iterating instruction loops. The counter size was 
chosen as size bits. This is the source of the 64 bit precision limitation of the pRing. Since 
the PES are implemented as serial ALUs using RAM instead of actual, fixed size shift 
registers, they do not Umit tiie precisions. Greater tiian 64 bit precision could actuaUy be 

25 implemented by double loops in the PCU microcode. 

The PE interface contains die address counters and contirol signal conditioning logic 
to drive the PEs. An address counter is provided for the accumulator memory address bus 
and another is provided for the weight memory address bus. The weight address counter can 
be gated onto the accumulator address bus for instructions involving two accumulator 

30 addresses. The on board accumulator latch/counter bus is used for the PCU to supply shift 

register memory addresses. The PCU's on board accumulator is distinct firom die 
accumulators associated with the PEs. 
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The MCU/ICU interface block 178 includes logic for downloading the PCU 
microcode, logic for receiving macro instructions and parameters, and logic for allowing the 
ICU to access the PE send and receive shift registers. Access to the WCS 182by the MCU 
for microcode download is provided by a programmed I/O path from the MCU. A 16 bit 

5 parameter latch is provided so that the MCU can deliver parameters and microinstruction 
subroutine addresses to the PCU via parameter memory. A fully interlocked handshake 
mechanism is provided for this interface using a parameter request (PRQ) signal emitted, 
under program control, by the PCU, and parameter acknowledge returned by the MCU. This 
mechanism allows for a microcoded instruction fetch routine to read an instruction into the 

10 local parameter memory where it can be executed. 

Access to the shift register memory on the PEs by the ICU is provided by a shift 
arbiter and multiplexing logic. The PCU is given priority for shift register access and an 
asynchronous handshake mechanism is implemented for the ICU interface. The PCU has 
preemptive priority to the shift registers but because of the relatively infrequent access by the 

15 PCU, this does not have much impact on communication of data vectors between pRings. 

The parameter and accumulator block 180 comprises a 16 word by 16 bit read/write 
memory, a sixteen bit adder and a counter/latch for an accumulator. This block is used for 
PCU scratch pad as well as for receiving and storing instructions and parametm from the 
MCU. The main purpose for the accumulator is to add user specified offsets to weight 

20 addresses when executing various supervector instructions. 

The PCU instruction word is very horizontally encoded. As such, many control 
strobes can be simultaneously asserted and several diverse microoperations can be 
accomplished in the same instruction cycle. The assembly language allows the expression of 
such instruction parallelism using an overiay syntax. An assembler that allows instruction 
'25 overlays and, to a small extent, performs instruction bit conflict detection has been 
implemented as part of this work. However, the microcode for the PCU is complex and 
generally requires an intimate knowledge of the PCU hardware to write. Fortunately, 
however, the pRing programmer will not program the PCU at this level. Instead, microcode 
subroutines for higher level macro instructions will be invoked. The PCU system 

30 programmer needs to program the PCU in order to implement the PE macro instruction set 
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as seen by the pRing programmer. 
D. The Master Control Unit (MCU) 

The MICU (Master/Interface Control Unit) 184 (Fig. 16)is the combination of the 
MCU (Master Control Unit) and the ICU (Interfece Control Unit) on a single PCB. This 

5 board 184 is piggy-backed on top of the PCU (Processor Control Unit). Photo 1 shows how 
the MICU fits in the system. The MCU and ICU were combined onto the same board for 
physical packaging reasons, however, they are logically distinct units. As such the MICU 
will be described by describing the MCU and ICU modules separately. 

The MCU (Master Control Unit) is the central controUer for the pRing and is depicted 

10 in Eig. 17. It is preferably a standard 80186 microprocessor 180 design. The MCU has 64K 
bytes of EPROM 190, 64K bytes of RAM 192, two RS232 asynchronous serial interfaces 194 
and 196 and interfaces to die ICU and PCU 198 and 200. The EPROM contains MCU 
bootstrap initialization code as well as an assembly level debugger (DBG86) and a remote 
debugger kernel (Paradigm's TDREMOTE) for use with Borland's source level, remote 

15 debugger. Until the availability of ANN level languages and compilers for the pRing 
architectures, the MCU is the level at which the pRing is programmed. In order to minimize 
the programming burden, the MCU is preferably implemented with the ubiquitous 8086 
family processor and has been configured for programming in the C language. The resident 
debugger and the remote debugger support tremendously reduce the software development 

20 task. 

A serial port 202 is connected to the development computer 204 and is used for 
downloading and executing code developed there. This can be done by using the resident 
DBG86 and a terminal program or by using the remote symbolic debugger. A switch selects 
which resident program executes after a reset. The other serial port was provided for 
25 individual pRing status displays, but, with the advent of a system level debug tool, it is now 
seldom used. 

The MCU is responsible for providing an instruction stream of PCU macro 
instructions during system operation. It also bootstrap loads the PCU microcode control 
. program into the PCU's WCS (Writable Control Store). The instruction stream interfece is 
30 a DMA (Direct Memory Access) channel between the PCU and the MCU. The asynchronous 
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PCU instruction and parameter interface is described in the earlier section on the PCU. 
Although the interface supports programmed I/O, the DMA interface is much faster and 
allows the MCU to attend to other overhead chores while instruction blocks are fetched by 
the PCU directiy from the MCU's memory. 

The MCU to ICU interface is handled using a small set of I/O locations and two 
interrupts. Although the interface is implemented using programmed I/O, the ICU is 
sufficiently autonomous to limit MCU interaction to pRing sized subvectors. 
E. The Interface Control Unit (ICU) 

The other major subsystem on the MICU is the ICU (Interface Control Unit). This 
10 unit, depicted in Fig. 18, is responsible for pRing to pRing communication. It is initialized 
and supervised by the MCU through the interface described above. The ICU interfaces to 
the PCU to obtain data for send operations and to deposit data during receive operations. 

The ICU to PCU interface protocol is preferably an asynchronous, fiilly interlocked 
handshake. The PCU does not distinguish send from receive opwations, i.e., it always shifts 
15 the PE data by one bit from least to most significant bit. Recall that the shift registers reside 
on the PE string boards but are controlled by the PCU. Thus, die ICU interfaces to tiie PCU 
where arbitiation and control for shift register access is implemented. 

Send and receive operations are named from the point of view of the ICU. Thus, if 
the ICU is performing a receive operation, it supplies data to the PCU and ignores the value 
20 returned by the PCU. For send operations, the ICU uses the bit supplied by tiie PCU and 
does not provides data on the PCU input bit. PE shift register addresses are provided by the 
ICU. 

f The primitive operation performed by the PCU for tfie ICU, then, is just shift one bit 

(along Uie PE string dimension) of the 256 bit shift registers available. Thus die logical 

"25 organization of the shift register file bits into words is accomplished by the ICU. However, 
the ICU itself imposes some constraints on the use of the shift register file. The ICU 
organizes Uie shift register file into sixteen registers of up to sixteen bits each with lower 
precision words justified to the low bits of the registers. This allows for some simplification 
of the ICU hardware witiiout an appreciable loss in flexibility. When the data vector 

30 precision exceeds sixteen bits (up to 64 is supported), the data is sent in 16 bit chunks. 
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There are four send destinations and the same four receive sources. Two of these are 
the adjacent pRings in the string. Another source/destination is the MNR bus. Thefinalport 
is the MCU. The MCU data path is provided for bootstrap and debug purposes. The ICU 
can send to, or receive from, a single device at a time. However, send and receive operation 
5 can occur simultaneously with independent source and destination with different shift register 
file addresses. Arbitration for use of the PCU interface between send and receive operations 
is performed by the ICU on a first come first served basis with ties being granted to the last 
operation that did not use the interface. In this manner, lockouts are avoided. 
F. The MNR Bus 

10 With reference to Fig. 19 the MNR bus is preferably a multiple access, multi-master, 

distributed arbitration, handshaked bus. It uses one clock line (BCLK) and one, open 
collector data line (BUSlO). All pRings are synchronized to die common bus clock which 
is provided from a single master source. Each pRing has its own local oscillator which 
provides the pRing's ICU clock. Thus some synchronization between die bus subsystem and 

15 the remainder of the ICU in each pRing is required. This is accompUshed using a variety of 
techniques (D2,D4). 

The MNR bus is unique in its use of protocols for reducing the required number of 
signals. Using only a clock and one signal line (named BUSIO), the bus implements a 
multi-master, receiver addressed handshaked data transfer protocol with distributed ariritiation 

20 and lockout avoidance protocol. This is accomplished by using die one signal wire 
differentiy in each phase of bus usage. Using diis mechanism, no signal bandwidth is lost 
as would be die case with several special purpose signals. For example, since communication 
is performed widi vectors between pRings, tiie bandwidth of separate address lines would be 
wasted during die vector transfer after die receiver was identified. Similarly, bandwidtfi 

25 would be wasted on separate signals for arbitration or data handshaking. The bandwidtii of 
tfiese wires could be much more efficiendy utiUzed by implementing multiple busses widi die 
marginal cost of only one wire per added bus (using the same global bus clock). 

Transmissions on die MNR bus are from a single source (caUed die master) to a single 
destination (called die receiver). A bus transmission consists of five phases which are IDLE, 

30 START, ARBITRATION, RECEIVER ID and DATA as shown in Fig. 19. Fig. 20 is a 
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simplified state diagram which shows the protocol for using the bus. 

The first phase 226, IDLE, is not really ia transmission phase since it is simply the idle 
state of the bus when no transmission is taking place. In this phase, BUSIO 218 (Fig. 19) 
is high (that is, it is not pulled down by the open collector devices connected to it). At any 

5 time during the IDLE phase, any pRing can pull BUSIO low to signal its intention to use the 
bus. This is called the START 228 phase and lasts for exactly one bus clock period. Since 
the bus is a single, open collector line, several pRings can, in fact, request use of the bus 
simultaneously in this manner. This is sorted out in the ensuing ARBITRATION phase 220. 
In the ARBITRATION phase, arbitration between potential senders (masters), is 

10 performed. Only pRings that are ready to send at the start of this phase may compete for the 
bus. Each pRing has a four bit bus address. Arbitration is accomplished by performing a 
distributed binary exchange search for the lowest addressed competing master. Thus pRings 
with lower addresses have higher priority for mastership of the bus. 

Contention resolution is performed in four bus clock cycles as follows. All potmtial 

15 senders that are ready at the start of the ARBITRATION phase 230 will participate in the first 
arbitration clock cycle by placing the most significant bit of their board address on the bus 
by pulling BUSIO low if their address bit is 0, otherwise not driving the bus. At the end of 
this clock cycle, pRings whose high address bit are different from the value on BUSIO drop 
out of the competition. Because of the open collector wire ANDing on the bus, pRings with 

20 a 0 in that address line remain on the bus. A pRing with a 1 in the high address bit remains 
on the bus if no pRings with a 0 in that bit were competing. The surviving contenders from 
the first round then paform the same sequence with the next most significant address bit. 
This process continues, eliminating lower priority (higher addressed) senders at each stage 
until, after the fourth clock, only one sender remains which is then the bus mast^.. 

25 The next phase, RECEIVER ID 232, is used for the sender to identify the receiver 

by the receiver's board address. This address is placed on the bus in four consecutive clocks 
starting with the most significant address bit. 

After the receiver has been identified, the DATA phase of the transmission occurs. 
This actually consists of the three sub phases RXREADY, DATA and TXREADY. In the 

30 clock cycle immediately foUowing RECEIVER ID, the receiver will indicate its readiness to 
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accept data by pulling BUSIO low. On this first RXREADY subphase 234, the receiver 
should be ready or the transmission is aborted (with no data sent). If the receiver is ready, 
the sender will place a bit of data on the bus in the next subphase (DATA). After this, the 
sender indicates its readiness to transmit another bit by pulling BUSIO low (this is the 
5 TXREADY phase 236). The sender has up to two bit times to become ready before the 
transmission is aborted. After the TXREADY phase 236 (assuming the sender indicated 
readiness to send) the RXREADY phase begins now giving the receiver up to two clocks 
(instead of one clock as on the first RXREADY phase) to indicate readiness to receive before 
transmission abortion. Thus, a bit of data can take 3, 4 or 5 clocks to communicate in the 

10 steady state. This process continues until either sender or receiver causes a transmission halt 
by not indicating readiness in its respective phase. Usually, this will happen at the end of a 
send block when the sending unit has no more data to transmit. However, it is possible for 
activity in the PCU and/or ICU of either the sender or the receiver to cause interruptions in 
the transmission. In these cases, the transmission simply starts again from where it was 

15 interrupted. The sender's and the receiver's counters maintain the current position status for 
the block. 

After the DATA phase, the bus normally the returns to the IDUB state 226 where 
BUSIO is pulled high. If, however, another sender has been waiting to use the bus, a 
separate idle cycle is not actually used. In this case, the START phase occurs immediately 
20 after the busy indication (BUSIO high) that caused the termination of the last transmission. 

The bus protocol also has a mechanism that prevents the high priority devices from 
locking out the lower priority ones. This might oth«wise be a problem if a high priority 
device is trying to make contact with an uninitialized receiver. The lockout avoidance 
protocol is simply that any master having been granted the bus cannot compete for it again 
25 until the bus has actually been idle for one clock. From the above description, an idle cycle 
can only occur if no sender is waiting to use the bus. 

To better understand how this lockout avoidance protocol works, consider the case of 
senders 0, 2 and 4 all trying to send to busy receivers. On the first transmission, they all 
compete for use of the bus, and pRing 0 wins the arbitration since it has the lowest bus 
30 address. Upon finding a busy receiver, the transmission is terminated. Senders 2 and 4 are 
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both waiting to use the bus. Although sender 0 also needs to use the bus, it cannot compete 
in the next cycle since the bus did not go to an idle cycle after the last transniission (senders 
2 and 4 kq)t it from going idle). Now 2 and 4 compete and this time 2 is granted the bus. 
Again, upon finding a busy receiver, the transmission is terminated. Now sender 4 is waiting 
to use the bus and neither senders 0 or 2 can compete for the bus because there have been 
no idle cycles since either was granted the bus. Finally, 4 is granted the bus (there was no 
competition) and the transmission aborts for lack of a ready receiver. Now, because all three 
senders have been granted the bus with no intervening idle cycles, none of them may compete 
for the bus. This causes a bus idle cycle which allows all three senders to, once again, 
compete for the bus. 

Each pRing operates on its own internal clock and is, therefore, asynchronous to every 
other pRing in the system (at the clock level). Thus, intra-pRing communication is 
handshaked. Higher level synchronization is the responsibility of the pRing programmer by 
advanced and careful scheduling of computation and communication operations. Variations 
in pRing clock speeds .usually result in implicit resynchronization at the subvector processing 
level by blocking operations in the ICU. However, system start and stop operations as well 
as synchronization in cases where the implicit ICU synchronization is inappropriate require 
anotiier global synchronization metiiod. This is accomplished by two open collector party line 
signals bussed to all pRings. Each pRing can pull tiie lines low and can read their state. 
While the use of these signals is application dependent, die following example demonstrates 
their use. 

The two party line sync signals are independent and in this example only one is 
required. First tiie signal is brought low by all pRings (tiiis is tiie default condition achieved 
after a reset). Then, as each pRing prepares to execute a cycle of computation (application 
defined), it releases its hold on the sync line and waits for the line to go high. The line will 
go high only when tiie last pRing has released it. After sufficiait time for all pRings to 
detect the high state of die line, each pRing brings the line low again and begins its cycle of 
computation. When the pRing completes its cycle, it releases die sync line waits for it to go 
high, indicating that all pRings have completed their computation cycle. 
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DEVELOPMENT OF SUPPORTENG SOFTWARE TOOL: 
A. The Instructioa Set and Programmiiig EBerarchy 

There is a hierarchy of levels at which the MNR architecture is programmed as shown 
in Fig. 21. At the top level, the ANN is decomposed and mapped 238 onto the pRing 
resources available. This process is currently manual and involves determining an efficient 
map from the virtual ANN processing and communication requirements to the pRings and 
communication scheduling. After a suitable decomposition has been determined, each pRing 
requires its own program. The pRing programs may be unique or several pRings may 
execute the same program depending on the degree of regularity in the ANN model. The 
pRing programs execute on the MCU and consist of a skeletal framework in which 
communication and vector data processing is scheduled. The pRing program running on the 
MCU issues communication commands 240 to the ICU and data processing macro-instructions 
242 to the PCU. It ensures synchronization between processing and communication by 
program constructs in the pRing program. Communication commands are carried out in ICU 
244 by hard wired state machines. The data processing macro-instruction stream sent to the 
PCU 246 is decoded and interpreted by a PCU micro-code program. This micro-code 
program causes die PCU to broadcast addresses and control signals to all attached PEs in a 
pRing. The addresses are used to fetch operands and store results in weight, accumulator and 
shift register memory. The control signals are interpreted by the PLBs on the PE string 
boards and used to direct the operands through the ALU. 

Fig. 22 depicts the chain of the hierarchy that deals with data processing. Overall 
control 248 is provided by die pRing program executing on the MCU. This program is 
written is C. Blocks of macro-instructions to carry out various phases of processing are set 
up in the MCU's memory. These blocks are created eitiier in advance or during ANN 
execution by a set of C macros which, when used in the program, have the appearance of an 
assembly language program embedded in the C code. The macro-instruction blocks 250 are 
sent to die PCU via a DMA channel. More accurately, die PCU fetches these instructions 
via die DMA channel. When an instruction and its parameters have been fetched by the 
PCU, die PCU executes a micro-code subroutine diat implements it. During die execution 
of die micro-code subroutine, die PCU can emit addresses 252 and control signals to the 
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attached PEs. 

The PEs interpret the control signals as opcodes 254 for single bit operations which 
are carried out on addressed operands by the PLBs. Every PE on a pRing executes the 
opcode on data at identical addresses within each PE. There are a few conditional 
instructions that can use local data within a PE to modify the execution of some operations. 
This gives rudimentary data dependent and, using tag constants, PE position dependent 
capabilities. These capabilities were minimized, however, in favor of a more compact and 
efficient PE design. 

B. The PCU Microcode Assembler 

The PCU is a custom designed microcoded control unit. From the requirements of 
PCU programs, a mnemonic PCU micro instruction set is described in accordance with the 
invention. Because the instruction wprd is horizontal, many instructions can be executed 
simultaneously. This limits the applicability of commercially available table driven 
cross-assemblers. Thus, a PCU specific assembly language and overlaying assembler (called 
CMDASM) are presented for developnient of the code. (See Appendix D) 

The PCU instruction word is very horizontally encoded. As such, many control 
strobes can be simultaneously asserted and several diverse microoperations can be 
accomplished in the same instruction cycle. The assembly language allows the expression of 
such instruction parallelism using an overlay syntax. An assembler that allows instruction 
overlays and, to a small extent, performs instruction bit conflict detection has been 
implemented. However, the microcode for the PCU is complex and generally requires an 
intimate knowledge of the PCU hardware to write. Fortunately, however, the pRing 
programmer will not program the PCU at this level. Instead, microcode subroutines for 
higher level macro instructions will be invoked. The PCU system programmer needs to 
program the PCU in order to implement the PE macro instruction set as seen by the pRing 
programmer. The instruction set definition and bit patterns are defined in a header file so 
changes can be easily made if the need arises. 

C. The MCU Control Language and Supporting Software Tools 

AS stated previously, the MCU (master control unit) uses an 80186 microprocessor 
188. Thus, code can be developed using readily available 8086 tools on the PC. For 
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example, Paradigm's Locate package can be used which modifies DOS EXE files for use in 
embedded 8086 appfication. As weU as PARADIGM'S TDEMOTE (a remote debugging 
kernel for Borland's source level debugger) to debug MOT control program at source level 
(C language). With the support of readily available terminal emulation programs, code for 

5 the MCU can be downloaded, executed and debugged on the MCU. A smaU utility (called 
TDX) communicates with TDREMOTE, download MCU control program (DOS EXE file) 
and start the execution without tiie need of Borland's source level debugger. 

The code for the MCU is being developed mainly in C using Borland's TurboC. A 
set of C macros is under development to define the PCU instruction set for use within the C 

10 programs (See Appendix E). These macros will allow the C programmer to view operations 
on the PCU as instructions executed witiiin tiie C program. They will also ease tiie task of 
transporting ANN code written for the MNR simulation to code which will .execute on tiie 
prototype hardware. 

Until tfie availability of ANN level languages and compilers for the pRing 
15 architectures, tfie MCU is die level at which tfie pRing is programmed. The MCU has been 
configured for programming in die C language. The resident debugger and tiie remote 
dd>ugger support trem«idously reduce the software devdopment task. 
D. System Level Supporting Software 

The system software from tfie level of tiie MCU (Master Control Umt) up to tiie host 
20 comprises loaders and debuggers on botii tiie MCU and tiie host computer as weU as fecilities 
for developing ANN application programs. 

An assembly language level debugger (DBG86) is used for downloading and 
debugging MCU code from tiie host Paradigm's Turbo remote debugger kernel (TDREM) 
can also be used to run on tiie MCU. This kernel allows tiie use of Borland's Turbo 
25 Debugger (TD) for source level symbolic debugging of C code on tfie MCU. A MCU 
program downloader, TDX. is developed for downloading MCU control program. 

A set of C language macros simplify tiie pRing programmer's view of tiie PCU. In 
addition, a smaU Ubrary of hardware dependent, low level primitive subroutines are available 
for use by tiie pRing, ANN application programmer (See Appendix F). 
30 TD and TDREM allow extensive debug fecilities on a single pRing. However, an 
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application in this architecture actually executes on several asynchronous pRings 
< simultaneously. MNR system level debug tool which is more geared to appUcation level 

debug of multi-pRing programs can be used, as well as a master-slave protocol for system 
synchronization, debug and control. In this scenario, one pRing is considered the master and 
5 all others are slaves. Each pRing has a synchronization point where it resynchronizes to all 
other pRings and can be directed, by the master, into a debug routine where the pRing state 
can be examined or modified by the master using the MNR bus. The master pRing is 
connected to the user's console and relays user commands and data to the slave pRings and 
collects user requested data from the slave pRings. The master pRing could be replaced by 
10 the host if desired which would require a mechanism for the host to communicate on the 
MNR bus. This will have the advantage of giving all of the resources of the host (most 
notably the file system and extensive memory) to the master. It will also reduce the burden 
on the pRing that would have been master. The slave code for this interface is anticipated to 
be relatively small. 

15 For the prototype system, a 80386-based PC is used as the host computer. The 

reasons for this choice are the low cost, good performance and open architecture of this 
machine. Also fectored in is wide-spread femiliarity with this machine among potential users, 
and readily available software development tools. The host computer connected to the MNR 
pRing coprocessor via a high speed interface. This configuration require an added circuit 
c20 board in the PC and one in the MNR coprocessor. The coprocessor will include the high 
speed interface card and a number of pRing boards, all connected via a backplane board. The 
exact number of pRings wiU depend on space, power and cost constraints; however, the 
system is designed to be expandable, so that the number of pRings is not a critical parameter 
for the design. It will be eventually bus loading constraints that may keep the number of 
ul5 pRing boards below a few dozen. 

The system comprises an MNR system level debug tool designed so that Host can sit 
on the MNR bus acting as a bus master with all of the pRings on the MNR bus as its slaves. 
Each slave pRing has a synchronization point where it resynchronizes with all the other 
pRings and with the bus master. The bus master is able to examine or modify the state of 

30 each one of the pRings. This offers the advantage that the bus master (and so the MNR 
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system) can access the host computer's resources (most notably the file system and extensive 
memory), while maintaining the access to the resources of the MNR system. 
E. MNR Simulation Software 

As a first step in proving the engineering practicality of the MNR architecture, a 
simulation was written to test and refine it. The simulation was performed at the instruction 
set level and was parameterized for simulated speed. 
1) Simulated MNR Architecture Model 

The simulation consists of a number of simulated pRings embedded in a simulated 
global communication structure as described in the preceding sections. Architecture and 
topology files are used to specify the simulated architecture parameters and global 
communication topology. 

Each simulated pRing has an architecture like that of Fig. 12 except that the number 
of ICU external interfaces is variable as a simulation parameter. The logical layout of the 
SIME central controller shown in Fig. 6 was retained in the simulation. 

Fig. 7 is a block diagram of the simulated MCU including the pRing instruction and 
sciatch-pad memory. The control unit (CU) depicted indicates an instruction register (IR) 
that holds the MCU instruction. There is a corresponding IR for the PCU and ICU. The 
instruction memory program counter is maintained by the MCU with the CU appropriately 
distributing instructions to the correct controller. Operands required for non-MCU instructions 
are automatically fetched by the MCU and forwarded to the requesting controller. An 
instruction destined for a busy controller will stop the fetch process until that contioUer 
becomes ready. In this way, synchronization among control units can be easUy accomplished 
by die pRing programmer. One minor exception is that the ICU is decomposed into send and 
receive subunits, each of which operate independently. So a busy ICU send in progress will 
not cause fetch suspension if an ICU receive instruction is fetched. Additional instructions 
are provided to check for busy conditions in the PCU and ICU (the MCU will never appear 
busy to the pRing program) to aUow for more efficient overlapping of control unit operation 
in more complex situations. 

Fig. 8 is the programmer's model of a pRing. The SIME control unit and the pRing 
instruction set allows the pRing to look much like a typical, general purpose CPU with an 
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attached vector coprocessor and a communication channel processor. The main difference 
4 is that the data processed and communicated by the attached vector coprocessor and 

communication channel is not typically visible to the MCU. In fact, the MCU is provided 
for directing the processing and communication for the PEs via the PCU and ICU. The local 
5 data memory ?nd MCU registers are used for housekeeping duties such as loop counting and 
address generation for major computation cycles. Note, that the PCU and ICU, themselves, 
provide counters and address generators so that the MCU need only provide control at a much 
higher level. 

The pRing instruction set is divided into three categories - MCU instructions, PCU 
10 instructions and ICU instructions. The MCU portion of the instruction set is similar to that 
of a standard microprocessor including the ability to manipulate data memory and internal 
registers. The PCU portion of the instruction set provides for vector processing in the 
attached PEs. All of the PCU instructions have an additional level of indirection since they 
supply address information that the PCU wiU use to control the PEs. The ICU instructions 
15 are dispatched to the ICU subunits for sending and receiving vectors. 
2) An Overview of The Simulation Program 

Simulation architecture and topology files are used to specify the details of the 
simulated MNR system. These include such things as the number of pRings (M), the number 
^ of PEs per pRing (k), the speed of each pRing control unit, the speed of a PE, the topology 

20 of the global communication network including the speed of the links, the PE arithmetic 
precision and the amount of weight and accumulator memory in each PE. 

The MNR topology file specifies the simulated physical connectivity of the MNR 
system under study. The number and type of global communication port for each pRing is 
specified here. The communication ports may be either private, point-to-point connections 
25 or shared bus connections. In either case, the interconnections among the pRings are also 
specified. Consistency checks are performed during the simulation to ensure that no 
communication conflicts are generated by the pRing programs. The speed of communication 
is also parameterized to assess the impact of communication on processing throughput. With 
this .flexible simulation of the global communication infrastructure, various members of the 
30 MNR architecture family can be simulated. Among these is the BMNR architecture. 
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Because simulating a massively parallel architecture on a CPU machine is inherently 
slow, the simulation was parameterized to set the arithmetic precision rather than 
accompUshing this with serial arithmetic routines. TTius, although the architecture described 
has dynamically programmable arithmetic precision, the simulation actually implements this 
5 with statically selectable precision and parameterized simulated arithmetic speed. The result 
is that model performance under various arithmetic precisions can be observed and modeled 
using a lumped speed model for arithmetic while the simulation can be accompUshed with 
relative efficiency. 

The simulation is performed at the machine code level. The simulated pRings are 
10 programmed in the assembly language which is compiled using a table driven cross assembler 
configured for the pRing. The resulting object code is executed by the simulator. 

In addition to serving as a measurement vehicle for the architecture, the simulation, 
turned out be a useful debug tool for developing pRing programs. The user interface 
included mechanisms for viewing weight and accumulator memory for every PE, registers 
15 and memory for each pRing and communication status for the global communication system. 
The simulation monitor also allowed breakpointing and single step capabiUty at the system, 
pRing and PE instruction level. Even supervector instructions could be executed one element 
at a time. Many of these capabiUties would require much additional circuitry in actual 
hardware implementation. Indeed, in the early stages of the hardware prototype development, 
20 simulation runs were used to help verify the prototype operation. 

The simulation programs and performance evaluation of the MNR architecture are 

described in more detail below. 

ADVANTAGES OF TBOS MNR ARCHITECTURE 

A. Expansion 

25 Expansion of the hardware which implements the MNR architecture, in order to 

extend the scale of the ANN. can be accompUshed in one of at least two ways. First, eadi 
pRing can be augmented by inserting additional PEs into the primitive ring. Logically this 
can be done witiiout limits, because of the global ring communication structure; however, the 
control unit fan-out should provide an upper Umit. Note that this upper Umit is quite high and 

30 can, itself, be extended by signal ampUfication. Expanding the system with tius method does 
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not cause a bus botUeneck, because even though the data packets sent over the bus are larger 
(increased by, say, a factor of it), the time used to process each p^ket is also correspondingly 
larger (increased by a factor of A^. 

The system may also be expanded by adding more pRings on the bus. The usual signal 

5 loading problems are relevant also here and provide an upper limit for this type of expansion 

i as well. But bus utilization should only increase if the number of logically q)ecialized slabs 

increases, since it is mainly the iriter-slab communication that requires the use of the bus. 

Since the MNR architecture is modularly expandable, its expansion cost increases 
linearly in terms of speed and capacity. Comparably, the fully parallel architecture is not 

10 modularly expandable. Cost of modularly extensible serial CPU and MNR architecture 
increases linearly. Extending fully parallel architectures result in exponentially increasing cost 
beyond a technology related, point of VLSI-implementation density, because of massive 
connectivity and packaging requirements. 

From the speed point of view, fully parallel architectures are the architectures most 

15 favored for small-to-medium size ANN models. But the expansion cost increases 
exponentially with speed due to technology constraints. The serial CPUs rely heavily on 
technology advances to gain speed improvements. The MNR architecture allows for modular 
expansion and at a linearly increasing cost without relying on technology advances. Notice 
that any future technology advances will benefit the cost/pCTformance figures of the MNR 

20 architecture as well. 

A more important issue in the development of ANN implementation architectures is 
to incorporate the ability to improve performance as ANN-system needs and resources 
change. Fully parallel architectures are etched in silicon during fabrication and' cannot be 
changed in the field of application. Serial CPUs can be upgraded in capacity but the system 

25 speed is mainly fixed for a given machine. Since the MNR architecture can be extended 
simply and modularly, e.g. by adding more PEs, and because each PE has local resources, 
existing MNR systems can easily be extended in both capacity and speed. 

The flexibility of an ANN implementation architecture can also be measured in terms 
of the classes of models realizable on the architecture and in terms of the spectrum of cost 

30 versus performance alternatives. Here again, the fully parallel architecture is very inflexible. 
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Serial CPUs allow the most flexibility in trans of model support, since the CPU has access 
to an interconnections and neuron-values. However, the cost/performance ratio for saial-CPU ;i 
simulations is fixed. The MNR architecture allows great cost versus performance flexibility 
both in design of new systems and in the upgrading of existing ones. The architecture is 
5 slightly less universal than with serial CPU in terms of models supported. This is because of 
the communication structure and locally SIMD nature of the pRing, However, neural 
networks are based on a high degree of regularity in processing and communication. The 
MNR architecture provides best support for models which exhibit this regularity. The 
regularity of neural network models is a local property. Correspondingly, the SIMD 

10 processing structure and the ring communication structure of the MNR architecture are local 
properties. Thus, the architecture can support a wide variety of models, which are locally 
regular but exhibit regional specificity. 

The MNR architecture, with its SIMD processing and ring communication structure, 
offers a regular, modular and expandable implementation of ANNs. The regularities in neural 

15 network models also contribute to high hardware utilization in MNR implementations. The 
global MIMD nature and the programmability of the pRings provide great flexibility to 
support a wide variety of neural network models. 

Since the architecture can be decomposed into VLSI implemratable building blocks, 
it is readily realizable using our most mature technology, and it need not wait for future 

20 technological breakdiroughs. 

B. Variable Precision Processing Elements 

Various ANN models have diverse arithmetic precision requirements. For example, 
the Asynchronous Binary Hopfield model requires only one bit activation values, while other 
models (such as ART) are described with continuous activations. Weight precision ^ 

25 requirements are equally varied* For example, using the . 15N stable memory estimate for 

the Hopfield associative memory model, a 200 neuron network would require less than six • 
bit precision for weights. On die other hand, an adaptive Back Propagation network may need 
several times that precision for complex error surfaces and small learning rates. 

These diverse precision requirements make it difficult to choose, at priori, an 

30 appropriate storage and processing hardware precision. Most general purpose ANN execution 
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platforms (e.g. general purpose computer simulations, digital signal processor accelerators 
and "neuro-computer" coprocessors) use a high precision (typically 24 to 80 bits) floating 
point number format. This is sufficient precision for all ANNs likely to be run on these 
systems, but it is tremendous overkill for many ANN models. 
5 The solution is to provide arithmetic units that allow the programmer to dynamically 

reconfigure the system for the precision required. There are two obvious ways to accomplish 
this. First, atomic PEs could be provided, each with the capability of some minimal 
precision arithmetic. For models requiring higher precision, strings of contiguous PEs could 
be allocated as single, more powerful PEs. This "dynamic bit slicing" technique allows the 

10 application programmer to trade the number of PEs against the speed of each PE, PE 
expansion ratios would be limited to a factor of perhaps 4 or eight without the addition of 
look ahead circuitry among grouped atomic PEs. Also, word serial multiplication is favored 
over flash multiplication because of its amenability to this kind of modularization. 

Restricting attention to binary integers, the time-space product for a multiplication 

15 scales, approximately^ as the product of the factor precisions. Ignoring control overhead 
offsets, this remains true whether the operation is performed in parallel (a "flash" multiplier), 
word serial or bit serial. The more parallel the multiplier, the more chip area that is taken 
whereas the more serial the implementation the more time that is taken. 

i However, control overhead tends to dominate the hardware complexity of a bit serial 

20 multiplier. But since the pRing is SIMD, most of the control can added as a nonrecurring 

cost (with respect to a PE) in the PCU. Furthermore, under the assumption of large ANN 
models, the system can use as many PEs as can be provided, which is not usually the case 
for smaller or less parallel problems. 

Thus, for the prototype implementation, bit serial arithmetic was chosen. The system 

.25 performance (in interconnections per second) is about the same as parallel arithmetic for fixed 
system cost and fixed precision. However, the fixed precision criteria is artificial since each 
ANN will have its own precision requirement. With fixed precision, parallel arithmetic 
circuits, ANN models with lower precision requirements will suffer a net decrease in effective 
hardware utilization. Serial arithmetic allows another dimension of flexibility in the ANN 

30 implementation. Now the ANN designer can trade speed with numeric precision. For 
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example, a 10 to 20 fold increase in speed is obtained over using fixed, 16 bit precision on 
a 200 neuron Hopfield associative memory. Weight memory is also bit addressable allowing 
more efficient allocation of weight memory and often a net reduction in the amount of 
memory required. 

The "dynamic bit slicing" technique trades the number of PEs with the precision of 
each PE while the serial processing technique trades the speed of each PE with precision. 
The net effect is the same in either case - lower precision yields a correspondingly higher 
system speed measured in interconnections per second- The latter approach gives a simpler, 
lower cost and more flexible design. The advantage of the former approach, is that, for 
higher precision models, there are less PEs and, therefore, potentially less fragmentation. 

As a final, practical implementation idea for variable precision PEs, there is a family 
of RAM based gate arrays available from Xilix. If the PEs were implemented using these 
parts, the number and precision of the PEs could be programmed by reloading the gate array 
architecture cells before execution of an ANN. 
Super- Vector Instructions 

Each pRing is an SIMD, vector processor. The components of the vector are the PEs. 
As such, every instruction broadcast to the PEs within a pRing is a vector instruction. A 
single add instruction is multiplied by the number of PEs in the pRing (k). In addition, a 
number of "Super-Vector" or ^tended instructions exist which perform vector operations on 
each component of a vector. These instructions are essentially vector instructions witii "step 
aiid repeat" capability added. 

For example the MAC instruction (multiply-accumulate) will multiply a weight by 
some value and add the result to some accumulator. This is k arithmetic operations (if 
multiply-accuniulate is considered a single operation) for the one MAC instruction. The 
"super-vector" form of this instruction is XMAC (extended multiply-accumulate). This 
instruction performs a MAC tiien steps die input data vector in the ring by one PE position. 
Steps the weight address by a user specified offset value and performs the MAC again. This 
process continues for up to k steps so that k? arithmetic operations are performed. This 
instruction produces k inner products and is most often used to implemmt, in a single 
instruction, a matrix-vector multiply for some submatrix of the weight matrix. 
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Other "super-vector" instructions are MACX which is like XMAC but the 
accumulation value is in motion, and XPRDCT which forms the outer product of two vectors. 
MACX is used when the weights are stored with the sending PEs rather than with the 
destination PEs as in the adaptation phase of Backpropagation. XPRDCT is used extensively 
5 in the learning phase of various ANNs. 

The time spent in processing a ANN is dominated by these types of operations. 
Inclusion of these in the instruction set serves to speed the operations by eliminating the 
overhead involved in instruction fetching and housekeeping by the centralized controller. It 
also allows for a somewhat slower (or simpler) central controller (MCU), because higher 
10 level housekeeping chores can be accomplished during the relatively long super-vector 
instruction times. 

SIMULATION OF THE MNR ARCHITECTURE 

The simulation and evaluation of the MNR architecture for implementation of ANNs 
15 will now be discussed. The performance and trade-offs of the MNR architecture have been 
tested. A powerful and extensible simulation tool is presented in accordance with the 
invention. 

SimM, for "5imulation of iWNR", is a simulation tool to represent the MNR 
architecture and its operation in software modules, and to provide an experimental 
20 environment for study of architectural trade-offs and for investigation of ANN 
model-dependent performance analysis. SimM is also used as an MNR architecture 
development tool, to assist in communication conflict resolution, to debug pRing programs, 
to analyze hardware utilization, and to experiment with various hardware configurations. 
SimM can also serve as an ANN model development-tool to confirm the theoretically 
25 predicted performance of proposed new ANN modds. 

To accompUsh the objectives stated above, SimM is be capable of reconfiguring 
through changes in parameters. These parameters include architectural parameters, topology 
parameters and pRing control programs. 

SimM also provides a set of on-line commands. One can operate SimM as if 
30 operating a real hardware-implemented MNR machine. This tool allows the developer to 
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examine and to change the values in the accumulator-memory and in the weight memory, to 
monitor the changing ANN states, and to check current utilization of devices. With SimM, 
one can test a new ANN model on the MNR implementation system as well as to test a new 
MNR design for an existing ANN model. 

Architectural parameters are used to define the physical architecture of the proposed 
MNR system. The available architectural parameters determine the flexibility of SimM as 
a MNR architectural simulator. The activities within SimM are regulated by the architectural 
parameters as if they were regulated by the physical MNR architecture. The architectural 
parameters supported by SimM include: 

1. Number of pRings on system (M) 

2. Number of PEs per pRing {k^K/M) 

3. Size of weight memory per PE 

4. Size of accumulator-memory per PE 

5. Number of communication ports for each pRing 

6. Channel bandwidth of communication links (B) 

7. Arithmetic precision of PEs {P) 

8. Relative system clock cycle, in nanoseconds {ns) 

9. pRing control unit (MCU) speed as a multiple of relative system clock 

10. PE speed as a multiple of relative system clock 

11. Interface control unit (ICU) speed as a multiple of relative system clock 
The topology parameters define the global pRing communication topology of the 

proposed MNR system. The MNR architecture involves two kinds of commimication: 
single-destination and multiple-destination commuiucation. With single-destination 
communication, the pRing communicates via the comniunication port indicated by the pRing 
control program with the other pRing. The pRing does not know the real identic of the 
corresponding pRing. The identities are defined by the topology parameter, which represents 
the real hardware connection between two pRings. Multiple-destination communication, on 
the other hand, involves one sending pRing and several receiving pRings. The sending pRing 
is the bus master while sending data, and the receiving pRings are bus slaves. 
Communication port 0 of each pRing is reserved for bus communication. The MNR 
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architecture can be configured as a multiple-bussed system or single-bussed system. SimM 
can be extended to cover most possible communication topologies to meet future investigation 
requirements. 

The pRing control programs define the activities of pRings. A pRing assembly 
language and corresponding machine code definition are included (Appendix A). The pRing 
control program to carry out their objective resides in the control memory of pRings. SimM 
can execute the control programs clock by clock, step by step or cycle by cycle as directed. 
This eases the debugging process, SimM also allows sharing pRing control programs to 
reduce the need for memory on the machine on which SimM is currently executing. 

SimM can be divided into two disjointed processing phases: construction and 
simulation. In the construction phase, the simulation program reads architectural parameters 
provided by the user to construct the target MNR system. The flexibility of MNR 
architecture is represented by the availability of architectural parameters supported by the 
simulation program. Currently, SimM supports eleven architectural parameters plus topology 
assignments and pRing control program assignments. 

The construction phase of SimM validates the MNR architecture design. In the 
simulation phase, the program simulates the activities of the MNR architecture, which, in 
turn, simulates ANN models. SimM executes pRing programs as the actual MNR system 
would do. Aside from providing ANN model results, SimM gives data such as elapsed 
simulation time and device utilization. Various factors can be derived from the data provided 
by the simulation program. The relationships among the subprocesses of the simulation 
program are shown in Fig. 23. 

In this phase, the Global Control is constructed for a proposed MNR system with the 
architectural parameters given by the user. The Global Control instructs the PRING to 
configure itself according to pRing-related architectural parameters. The PRING then 
generates PEs along with their corresponding accumulator-memory and weight memory. The 
PRING also creates a PCB (pRing Control Block) for every pRing. The Program Loader 
loads the pRing control program for every pRing and then puts the starting address and length 
of the pRing control program into the PCB. 

The Global Communication Control uses communication-related architectural 
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parameters and topology parameters to construct the communication channels of the proposed 

MNR system. Fig. 24 shows the construction phase of SimM. 

The Monitor dominates the simulation process during the simulation phase. The user 

controls simulation by using the Monitor. Performance and utilization data are produced by 
5 the Monitor. Although there typically exists only a single copy of the PRING subprogram, 

Monitor caUs PRING with different PCBs to create the illusion of more than one pRing in 

the system. For every clock period. Monitor preferably presents every PCB to PRING once. 

That is, each pRing executes one clock cycle of its pRing control program. This can be 

viewed as a time-sharing system in which every pRing has equal time slices. In this way, 
10 Monitor simulates an MIMD MNR system on an SISD (Single Instruction-stream Multiple 

Data-stream) machine. Fig. 25 shows the Simulation phase of the simulation program. 

The SimM is composed of several functional units: Global Control, Global 

Communication Control, Program Loader, PRING, HOST and Monitor. Global Control sets 

up the simulation environment Global Communication Control handles inter-ring 
15 communications. Program Loader loads the pRing control programs for evoy pRing. 

PRING executes pRing control programs. Monitor controls the simulation environment and 

acts as a bridge between the MNR and the outside world. 

Global Control constructs the MNR system by setting up a PCB for every pRing. The 

contents of the PCBs are associated witii the architectural parameters provided by the user 
20 (Appendix C). The PCBs keep the pRing processing status information during simulation. 

Program Loader loads the pRing control programs for every pRing. Program Loader then 

puts the starting address and length of the pRing control program in the PCB associated with 

that pRing. Fig. 26 shows the Global Control and the Program Loader of SimM. 

The pRings are the core of the MNR architecture. As discussed previously massive 
25 parallelism proposed in this architecture comes from the parallelisms among PEs and among 

pRings. SimM treats each pRing as if it were an independent CPU executing its own control 

program- All the synchronous problems encountered will be resolved by the techniques used 

in solving communication and networking problems. 

A pRing is composed of PEs, an I/F, a CM and a CU. The entire pRing operation 
30 is controlled by the CU, which actually comprises three major components: MCU (master 



BNSOOaO:<WO 9314459A1> 



wo 93/14459 




PCr/US93/00365 



55 

control unit), PCU (PE control unit) and ICU (interfece control unit). Each of them serves 
a unique function within the pRing. The MCU fetches instructions and passes them to the 
PCU and the ICU. The PCU handles neuron arithmetics and the IGU settles 
communications. The PCU and the ICU are viewed as special-purpose co-processors attached 
5 to the MCU, which is the central processor. All processors execute their own classes of 
instructions simultaneously. There exist in MNR machine language special instructions to 
synchronize these processors. In simulating these activities, SimM is able to manipulate both 
the parallelisms among PEs, and among and within pRings, as well as the serialism within 
the MCU, the PCU and the ICU. 

10 PRING executes the pRing control program as a real pRing would do. A pRing 

control program can contain three classes of instructions: local instructions, PE instructions 
and interface instructions. MCU executes local instructions which are mosUy housekeeping 
or conditional program control transfers. PCU executes PE instructions which are mostly 
vectorized arithmetic instructions. ICU executes interface instructions which involve 

15 inter-pRing communication. Fig. 8 shows the pRing from a programmer's point of view. 

In SimM, there typically exists only a single copy of PRING. SimM repeatedly reuses 
the only copy of PRING with different PCBs to simulate more than one pRing executing 
simultaneously. One can recognize this technique as fixed memory multiprogramming 

" management; pRing is a piece of fixed memory, there are a lot of virtual pRings to be 

'20 allocated to this pRing. The only difference from memory management is that the virtual 
pRing allocation is sequential in SimM. The same technique applies to the PEs, SimM has 
only a single copy of the PE. Through reusing the same piece of code, SimM creates an 
illusion that a lot more PEs exist on the system. 

This way SimM can simulate various numbers of pRings and PEs without 

25 modification. The workloads of PCU, MCU and ICU are different from each other. It may 
be necessary to use different clock rates for each of them. SimM can execute different 
classes of instructions in different clock frequencies. To define system clock frequency, use 
the architectural parameter CLOCK. The device clock can defined as a multiple of the 
system clock. PE^SPEED, CU^SPEED and IF_SPEED define clock speeds for PCU, MCU 

30 and ICU respectively. Number of pRing is defined by the architectural parameter 
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NO_PRING. 

Every pRing in SimM has a PCB associated with it Each PCB contains pRing-related * 
architectural parameters and the current pRing operating status. PRING is invoked by 
Monitor once per clock cycle. Once PRING is invoked, every pRing on the system executes 
5 a single clock cycle. The pRing invokes its components, namely, MCU, PCU and ICU, if 
the current clock cycle count is a multiple of the components' design speed. Fig. 27 shows 
the PRING module of SimM. 

SimM assumes HOST is another pRing without PEs. A HOST control program 
contains only interface instructions and local instructions. Fig. 28 shows the HOST module 
10 ' of the simulation program. 

The master control unit, MCU, is responsible for fetching instructions from the control 
memory and either executing them or dispatching them to other control units. The MCU also 
contains an ALU (Arithmetic-Logic unit) and a smaU set of registers, which are used for 
housekeeping in execution of the pRing control program. Fig. 7 shows the block diagram 

15 of the MCU. 

A. bistructioii fetcher and dispatcher. 

Instruction fetching and dispatching time is intended to be fiiUy overlapped with 
execution time, even if it turns out that the overlapping is not necessary according to the 
simulation results. MCU is tyically the only unit in pRing which has the privilege of 

20 accessing control memory. MCU fetches an instruction, then decides to which controUer it 
should dispatch this instruction by judging the first and second most significant bits of the 
instruction. ICU and PCU each have a 4-word-length instruction buffer. The lengthy can 
be changed by recompiling SimM. MCU stops fetching tiie next instruction under two 
circumstances: the ICU (or PCU) instruction buffer is fiili and the next instruction is again 

25 an ICU (PCU) instruction, or the program counter reaches the end of the control program. 

B. Housekeeping. 

MCU keeps track of the status of ICU and PCU by examining the content of the status 
register. Botii ICU and PCU will update die content of die status register according to tfieir 
ongoing status, i.e., busy or not MCU refers to die statiis register when MCU operates 
30 conditional program control transfers. Programmed Synchronization within pRing (i.e.. 
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among MCU, PCU and ICU) can be accomplished by busy-waiting for the units to be 
synchronized. Instructions below resynchronize MCU, ICU and PCU. 

JPBSY $ ;wait for PEs 
5 JRBSY $ ;wait for ICU receiving 

JSBSY $ ;wait for ICU sending 

The status register contains normal ALU flags such as Zero, Sign, Carry, and control unit 
status flags like PBSY(PEs are busy), ISBY(Interface unit is busy in sending), IRBY(Interface 

10 unit is busy in receiving). Conditional control transfer instructions refer to the status register 
to achieve partial event-driven control. 

Keeping all statistical information for the current pRing is part of MCU' s job, though the 
real controll» of PCBs in SimM is Monitor. Monitor collects information in the PCB to 
generate simulation statistics. 

15 After fetching a PE instruction from the instruction buffer, PCU decodes the PE 

instruction and broadcasts PE microcodes to every PE in the pRing. In other words, PCU 
serves as a representative of PEs to manage control signals from MCU to PEs, or vice verse. 
MCU views all PEs together as a vectorized arithmetic processor. As do users of SimM. 
The number of PEs is defined by the architectural parameter NO_PE_PRING- 

20 C. Processing element (PE) 

Neuron activities occur within PEs. For each PE instruction issued by MCU, PCU 
transforms the instruction to process a data packet. A data packet contains a portion of the 
neuron data vector. The length of data packets is usually equal to the number of PEs in 
pRing, unless, there exists fragmentation problems. As stated previously, a PE performs 

25 primitive ALU functions. Major functions of PE are addition and multiplication. These 
major functions are sufficient to perform neural computing. To keep PE as simple and 
flexible as possible, a bit serial ALU was selected for it although other components are 
available. As a result, MNR system performance will change as data precision changes. 
This is obvious, since PE needs a longer time to processes a data vector. This feature, which 

30 changes data precision not only affects data accuracy, but also processing speed. Thus, the 
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MNR has a trade-off of precision for speed. For SimM, it is nothing but a waste of time to 
simulate every step of the bit serial process. A decision was made to keep both tfie bit serial 
ALU feature and simulation speed. That is, to keep the timing needed for the bit serial ALU 
in PE, but do the actual arithmetic parailelly. This mechanism speeds up SimM's operation. 
The precision of data is limited to byte-boundaries in SimM although the timing of various 
lengths of precision are kept. The precision is set by the architectural parameter 
ARITHMETIC JLEN. 
D. Weight memory 

Weight memory holds a portion of the weight matrix of the simulated neural network. 
The size of weight memory is based upon the size of the weight matrix, the number of PEs 
in the system, and the number of neurons in the system. In any case the size of weight 
memory should be decided during pRing control program development. The size of the 
weight memory grows exponentially with the number of neurons in the fully connected 
network. Soon enough, weight memory allocation will substantially use up available main 
memory. So the size of the weight memory sets the limit for SimM. The architectural 
parameter WMEM_SIZE defines the weight memory size. 
£• Accumulator-memory 

Accumulator-memory stores the neurons* previous activation value and curxmt partial 
sum. Accumulator-memory also serves as a scratchpad for each PE, If we view the PE as 
an ALU, accumulator-memory is die register file of this ALU. The AMEM_SIZE 
architectural parameter defines the size of the accumulator-memory. 
F. Processing buffer 

Each PE has three buffers, the transmitting buffer (T), the receiving buffer (R) and die 
processing buffer (P). Generally speaking, T and R both are controlled by. the ICU instead 
of the PE. Every PE operation involves a P buffer, whether multiply-accumulate (XMAQ, 
data movement (PUT and GET) or arithmetic (ADDA, SUBA . . .etc.). P buffers are 
circular buffers, so the content of the P buffer can be shifted to the next PE while die PE is 
processing the current content of the P buffer. Since we used a bit serial ALU for the PE, 
the processing of multiplication or addition is much slower than a buffer shift. That is, the 
time needed for buffer shifting can be fully overlapped with the time needed for 
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multiply-accumulate. 

Dynamic memory allocation is used in SimM to implement memory and buffer. SimM 
implements the P buffer with linked list. Accumulator-memory and weight memory are 
implemented as linked arrays in SimM. 
5 The interface control unit (ICU) is the I/O manager of the pRing. ICU controls the 

routine from and to the outside of the pRing. Sending and receiving data vectors always take 
place at the same time in each pRings. To resolve the sending and receiving bottleneck, ICU 
has different interface units for sending and receiving data vectors. The PE has different 
buffers to store incoming and outgoing data vectors. Both interface units access data vectors 
10 with pRing communication ports. 

ICU instructs the interface unit to send out whatever is in the T buffer, whenever ICU 
encounters a SEND instruction in the pRing control program. Before the transmission begins, 
the communication link between pRings must be established. In SimM, ICU sends a sending 
request to the global communication control. The request will be granted only if both parties 
15 involved are ready ta proceed. 

The process for receiving a data vector is essential. ICU sends a receiving request to the 
global communication control. This request will be granted only if both ends of the 
communication link are ready. After the communication link is established, the data vector 
flows from the T buffer of the sending pRing to the R buffer of the receiving pRing. 
20 Communication bandwidth is adjustable through architectural parameters. Hardware 

communication connections are determined by topology parameters. Communication links 
can be established between pRings only if there exists a hardware communication connection 
between the pRings. 

The Global Communication Control resolves communication conflicts between pRings. 

£S In the construction phase, Global Communication Control builds the communication channels 
according to topology parameters and architectural parameters. In the simulation phase. 
Global Communication manipulates inter-pRing communication requests, either via a private 
channel or via the bus. Both communication conflicts and bus broadcasts are resolved by 
Global Communication. Fig. 29 shows the Global Communication Control of SimM. 

30 A pRing involves both single-destination and multiple-destination communication. There 
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exists Uttie difference from the pRing's point of view, since the ICU isolates the pRing from 

the outside world. But the Global Communication Control does distinguish 

multiple-destination communication requests from single-destination communication requests. 

For simpUcity, bus communication for multiple-destinations communication and point 
5 communication for single-destination communication are used. 

Bus communication involves a sending pRing and several receiving pRings. SimM uses 

a signed-up procedure to implement bus communication. Identities of pRings which want to 

receive data from a specific bus wiU be wait-Usted by the Global Communication Control. 

The receiving pRings wait until a sending pRing wants to send data through that bus. The 
10 Global Communication Control informs those waiting pRings that they will receive data from 

the bus, while it teUs the sending pRing to send out its data. 

Global Communication Control typicaUy allows only one sending pRing per bus or point 

communication link estabUshed. For point communication, there should be also be only one 

rec^ving pRing on the same communication link. 
15 Global Communication Control monitors the current status of every communication 

connection. ICUs also report thdr current status to Global Communication Control, so 

Global Communication Control would have overview of the MNR system communication 

topology and status. 

To prevent programmer's mistakes, Global Communication Control rejects 
20 communication requests involving illegal communication ports, nonexistent hardware 
connections or nonexistent buses. Global Communication Control rejects inconsistent 
topology definitions of topology parameters, i.e., hardware communication connections 
should contain both part of the communication link. 

The major part of SimM is Monitor. Monitor consists of Monitor Intoface, Error 
25 Handler and Monitor Control. Monitor Interface is the man-machine interface of the 
simulation program (see Appendix C for detaUs of man-machine interface of SimM. Error 
Handler generates an alarms whenever an error occurs. The major function of Monitor 
Control is to manage PCBs, Monitor provides performance-related data and coinponent 
utilization data from the contents of PCBs. Fig. 30 shows the component parts of Monitor. 
30 The MNR architecture vaUdation and evaluation are the major objectives of SimM. To 
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confirm an MNR architecture, the architecture is described in architectural parameters and 
topology parameters as stated above. SimM reports errors discovered during the construction 
phase. SimM subsequently reports inconsistency between pRing control programs and MNR 
architecture. 

5 Communication connections are one of the major concerns in designing an MNR system. 

SimM is flexible enough to test all possible communication connections. SimM also has the 
option to determine communication channel bandwidth and clock timing of interface units. 

Communication conflicts are very difficult to detect and resolve on paper. Complexity 
of tiie problems grows exponentially with the number of pRings. SimM can be used to 
10 resolve communication conflicts, whether the conflicts are caused by the topology or by the 
pRing control program. 

The best way to debug a program is to run it, SimM simulates the MNR architecture to 
microcoded level, so SimM executes pRing programs that are compiled to MNR machine 
codes. SimM also provides a set of on-line commands to assist the debugging process. 
15 SimM can be used as a utilization analysis tool to investigate both the 

architecture-dependent utilization and the ANN model dependent utilization. SimM provides 
at any time Uie utilizations of MCUs, PCUs and ICUs. By changing the architecture or 
execution speed of each device, the location of the execution bottleneck can be determined. 
> This delivers information needed to design die most frequentiy utilized type of MNR 

20 hardware at minimum cost. 

As stated previously, to change a simulated ANN model on SimM requires merely a 
change in pRing control programs. This way, different ANN models can be simulated by 
SimM on the same or even different architectures of MNR. 

Architectural trade-offs of modules of MNR should be evaluated during the design phase of 
25 an MNR system. Those trade-offs, like fewer large pRings versus more small pRings, relative 
speeds of MCU, ICU and PCU, more point communication or a faster bus communication, 
serial ALU versus parallel ALU, etc., can be important factors in designing an MNR system. 
Those trade-offs can also be application-dependent. This is where an intensive simulation 
program is helpful. SimM is flexible enough to test possible combinations of architectural 
30 factors. SimM evaluates performance as part of architecture confirmation. 
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MNR architecture does not discriminate among ANN models. A different ANN modd is 
merely a diffeent set of pRing control programs for MNR. SimM can execute generally any 
pRing program assigned to it. To evaluate model-dependent poformance, a model's pRing 
control programs need only be executed. The simulation status of SimM gives the 
performance of each individual model. From the information given by SimM, model 
dependent performance can be evaluated. 

The performance and the properties of the MNR architecture, using SimM as a tool, are 
investigated, evaluated and discussed below. In discussion, the term 'Interconnections/second' 
ref«s to the number of muMply-accumulate operations that are performed in a second. 
DARPA has, in the past used speed (measured in intCTConnections per second) and capacity 
(measured in interconnects) as reference-variables. The same metrics are used in the 
following discussion. The capacity of the MNR architecture is limited only by the capacity 
of its weight memories; so the capacity of the MNR architecture is potentially infinite. 
Therefore, attention is focused on the performance of the MNR architecture in terms of speed 
and utilization. The inter-ring communication bandwidth, the number of PEs, the distribution 
of the PEs 0..e., the number of pRings on the bus), and the arithmetic precision Q..G,, the 
number of bits needed to rq)resent neuron-values and wdghts) are the major fectors that 
affect tiie speed and utilization in the MNR architecture. The most advant^eous trade-offe 
between tiiese fectors for the MNR architecture is illustiaied below. The ANN model used 
in the tests is mainly the Hopfield model, though other models are also discussed. 

Throughout the discussion, N rq)resents the number of neurons in the system; K represents 
the number of PEs in tfie system; k represents the number of PEs in each pRing; M represents 
die number of pRings in die system; r stands for time, with subscript leXtsa r^resenting tiie 
specific operation (e.g. represents time needed for a bitwise multiply-accumulate 
operation); P stands for precision of arithmetic operations; / stands for intKconnections and 
V represents system speed. Speed and capacity are two major factors that DARPA uses in 
evaluating ANN implementation tools. In DARPA's original speed/capacity plane (see Fig. 
9), analog fiilly parallel architecture resides in the upper left comer of the plane, which 
indicates that it opiates at very high speed, but with only limited capacity. Serial central 
processor occupies the lower half of the plane indicating that the architecture can be expanded 
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in capacity (by adding more memory) but with severe limits imposed on speed. The region 
between both architectures is where the MNR architecture fits. The speed of the MNR 
architecture can be expressed as: 



(1) 



K 



5 Equation (1) shows that the speed of the MNR architecture is independent of the size of the 
network, given that the neuron PE ratio remains constant. The equation also tells us that the 
system speed increases linearly with the number of P£s,and that the capacity 
(interconnections) increases linearly with the number of PEs (i.e., each PE comes with its 
own local memories). This indicates that both the speed and capacity of the MNR architecture 
10 grow linearly with the number of PEs. The performance and properties of the MNR 
architecture in simulating large scale models are derived from simulating and scaling smaller 
models. 

Test#l is designed to validate the speed equation (1). In Test#l, conditions are as follows: 
1. N/K = 1 
15 2. M = 1 

3. T„ = 200ns 

With P=8 bits, the speed of the MNR architecture is 

K (2) 



-5 



1.28x10 



and if P-J6 bits, the speed of the MNR architecture is 

K (3) 



5.12x10-* 



Test#l is set up with a single pRing to eliminate the possible PE utilization loss due to the 
inter-ring communication. Fig. 31 shows the simulation results and the predictions from 
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computations: the speed of the MNR architecture increased linearly with the number of PEs 
in the system. The performance of the MNR architecture in terms of speed and capacity 
exactiy resides within the middle region as predicted. The deviation in speed of the MNR 
architecture is caused by changing the degree of arithmetic precision. The speed of the MNR 
5 architecture is degraded by a factor of for different precisions, because of the bit-serial 
ALU design of PEs. But the linearity of the speed/capacity still holds for the various 
aritiimetic precisions. The linearity in expansion provides the benefit of investigating the 
properties of larger model implementation by running smaUer models on tiie MNR 
architecture. This can shorten tiie design and development time needed for large ANN model 

10 implementations. 

The observed linearity also provides almost infinite expansion power to the MNR 
architecture in both speed and capacity, if the PEs are fiilly utilized. Unfortunately that is not 
always the case. Several fectors such as communication bandwidth between pRings and 
neuron-PE ratio may reduce tiie PE utilization of tiie MNR architecture. 

15 As stated previously, tiie performance of die MNR architecture in implementing large scale 

ANN models can be doived from die performance in simulating smaller models. To further 
investigate the properties of die MNR architecttire, a 30-neuron fiilly-connected Hopfield 
model ( interconnections) is tested. 
The number of PEs {K) defines die processing power for the MNR architecttire. The 

20 overall system performance (speed and capacity) grows linearly witii die number of PEs. 
Test#2 is designed to verify die relationship between tiie MNR architecture performance and 
die number of PEs in the system. The capacity grows linearly witii die number of PEs, since 
each PE carries its own weight memories. System speed (V) is considered in terms of die 
number of PEs in die system. The settip conditions for Test#2 are: 

25 1. P = 8 bits 

2. M = 1 

3. N = 30 a = 900) 

4. T™, = 200ns 

In Test#2, K is. die only variable. The changes in speed are caused solely by K. By 
30 changing die neuron-PE ratio (iV/X) is also changed. System speed should decrease when 
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N/K increased, since each PE must serve more neurons in the system. Fig. 32 and Fig. 33 
show the results of Test#2. 

With reference to Fig. 32, the system speed of the MNR architecture grows linearly with 
the number of PEs. This is expressed by a speed function as: 



C-. ' 



The equation (4) agrees with equation (I) if , where both P and are 

constants. The relationship between system speed and neuron PE ratio is a linear degradation. 
Fig. 33 gives the degradation of speed when increasing the neuron PE ratio. 
Now, the neuron-PE ratio is taken into consideration and the following equation is derived 
10 from Fig. 33: 

k (5) 
K=Cfimc— 
N 



Equation (5) reflects the degradation when increasing N/K. Comparing equation (5) with 
equation (1), a great simUarity can be find. The only difference is caused by the fixed neuron 
PE ratio of Test#l. Thus a more general expression for speed for the MNR architecture can 
15 be derived from equations (1) and (5): 

K (6) 



20 



Equation (6) reflects the degradation of increasing the neuron-PE ratio; i.e., each PE must 
serve more neurons. The degradations are also linear. Equation (6) can be explained as: the 
system speed of the MNR architecture grows linearly with the number of PEs in the system, 
is degraded linearly by the size of model, by the speed of basic multiply-accumulate 
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operations, and by the square of arithmetic precision. 

In Figs. 32 and 33 the scales are logarithmic. These figures are redrawn and shown in 
Figs. 34 and 35 with linear scales on the abscissas. These figures provide more accurate 
readings. 

The results of Tes0?2 meet the previously made assumptions: the processing power of the 
MNR architecture grows linearly with the number of PEs. The results have also shown, as 
already predicted, that speed is reduced when the neuron PE ratio is increased. 

In the MNR architecture, the upper limit in the number of PEs is equal to the number of 
neurons in the system. If this limit is exceeded, the system performance will not increase 
when more PEs are introduced into the system. The MNR architecture operates at its 
maximum speed when the neuron PE ratio equals one. Thus, equation (6) is valid only for 

Utilization, yet another important factor, which also affects the system speed. For the case 
where the neuron PE ratio is greater than one, the decrease in PE utilization wiU decrease the 
system speed. This problem is referred to as the 'fragmentation problem*. Equation (6) 
assumes tiiat tiie PEs are fully utilized. Equation (7) covers the case where they are not fully 
utilized. 

The distribution of the PEs over the pRings of the MNR architecmre influences the number 
of circulation cycles witiiin die pRings and die length of die inter-pRing communication data 
vector. Larger pRings will perform more efficiently if tiie communication channel cannot 
keep up widi die processing speed of the PEs. Smaller pRings, on die odier hand, will 
perfiwm more effidendy if die bandwidtii of die inter-pRing communication channel is large 
enough to cope widi die fast processing PEs. Larger pRings will reduce die inter-pRing 
communication requirements, but would most likely suffer from die fragmentation problem. 
Smaller pRings wiU bring more flexibiUty to the MNR architecture, but will suffer from a 
slow communication channel. The various distributions of die PEs over the pRings of the 
MNR architectiire, and tiieir contributions to die MNR system performance is discussed 
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below. Test#3 is designed to investigate the effects of PE distribution in terms of speed and 
utilization. The setup conditions for Test#3 are: 

1. P = 8 bits 

2. K = 30 

3. N = 30 (I = 900) 

4. = 200ns 

The number of pRings (M) is one of the variables in Test#3. The other is the inter-pRing 
communication bandwidth (B). The PE utilizations of a multi-ring MNR architecture are 
affected heavily by the communication bandwidth, and the PE utilization of the MNR 
architecture directly relates to the system speed. Figs. 36-38 show the results of Test#3. 

Fig. 36 shows how the distribution of PEs affects the MNR performance under various 
communication bandwidths. The communication problem dominates the choice of the pRing 
size. In Fig. 36, the performance of the MNR architecture grows rapidly with the size of the 
ring if the inter-pRing communication is as slow as 10 Kbits/second, The growth is not so 
obvious, if the inter-pRing communication is as fast as 100 Mbits/second. 

Figs. 37 and 38 show the changes in PCU utilization and ICU utilization when the number 
of PEs per pRing changes under various communication bandwidths. The ICU utilization 
decreases when the PCU utilization increases. By comparing Fig. 36 and Fig. 37, the 
performance of the MNR architecture grows with the PCU utilization. That is understandable, 
since the PEs offer the major processing power in the MNR architecture. 

The distribution of PEs should provide an additional consideration in choosing the 
communication channels. And the communication bandwidth should provide an additional 
consideration in choosing between large. pRings and small pRings. How will the 
communication bandwidth affect the performance of the MNR architecture under different PE 
distributions? To uncover the effects of communication bandwidth on the different 
distributions of PEs, the results of Test#3 are redrawn as shown in Figs. 39-41. 

Fig. 39 shows the changes of speed in terms of communication bandwidth under different 
pRing sizes. The best performance of the MNR architecture happened in either a single pRing 
or when very fast communication channels exist. 

The best PCU utilization and the worst ICU utilization, both of which reflect that most of 
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the system processing time is devoted to neural network processing and the least time to the 
communication overhead, happen in either a single pRing or very fest communication 

channels. 

With reference to Figs. 39-41, the best performance attainable from a fixed number of PEs 

5 is to put them into a single pRing (i.e., M=l), or to distribute them equally over a number 
of pRings, given very fast inter-pRing communication channels. 

The results of Test#3 suggest that larger pRings should be used. The larger the pRing, the 
higher utilization of the PEs. The larger the pRing. the faster the system operates. The best 
way is to put all the PEs in the same pRing. Unfortunately, the acceleration of performance 

10 stops at the point where the neuron PE ratio equals to one, because of the fragmentation 
problem stated above. pRing fragmentation is dealt with just memory fragmentation problems 
are dealt with in digital computers. A pRing is an allocatable unit within the MNR 
architecture. If k represents pRing size, i.e., k=K/M, then the average fragmentation loss 
would be approximately k/2. Another consideration in favor of smaller pRings is the 

15 advantage of modularity. Thus, if a fast communication channel exists between the pRings, 
then several small rings would give tiie best of both worlds. 

The bandwidth of the inter-pRing communication channel determines controls the exchange 
rate of the data vector between pRings. If the communication bandwidth decreases, the 
utilization of PEs and the performance of the MNR architectote will also seriously degrade. 

20 Thus, the performance and the PCU utilization are both Umited by the communication 
bandwidth. As above, the MNR architecture introduces three-way overlapping operations in 
the pRing, i.e.. sending, recwving and processing. To evaluate tfie impact of the 
communication bandwidth an experiment is designed to test the various configurations. For 
example, Test#4 is designed to explore the role of the communication bandwidth in the MNR 

25 architecture. The setup parameters of Test#4 are: 

1. P = 8 bits 

2. M = 5 

3. N = 30 a = 900) 

4. T^ = 200ns 

30 In Test#4, the number of PEs in the system, i.e. , K is changed. These «: PEs are equally 



BNS0OCI0:<WO 9314459A1> 



wo 93/14459 




PCr/US93/00365 



69 

distributed over M pRings, so each pRing would contain k (=K/Ad) PEs. The communication 
' bandwidth (B) is ^pected to be an important factor in detmnining the utilization of each 

device. The PEs are almost always waiting for communication, if is too small (slow 
communication channels). The interface units are idling most of the time, if B is too large 

5 (fast communication channels). Since the PE's processing loads remain constant in terms of 
the number of multiply-accumulate operations, increased PE utilization will have the effect 
of improving system speed. 

The behaviors of the MNR architecture under different communication bandwidths are 
shown in Fig. 42 and 43. The system speed of the MNR architecture grows as the 

10 communication bandwidth increases. Meanwhile, the PCU utilization grows rapidly with the 
communication bandwidth towards 100%. The ICU utilization drops, when communication 
bandwidth increases. The growth of the PCU utilization shoots up at a certain communication 
bandwidth. The utilization of ICU drops rapidly at the same communication bandwidth. 
The system speed of the MNR architecture increases as the PCU utilization increases 

15 (equation (7)). The speed of the MNR architecture ceases to increase when the PCU 
utilization is limited. This result validates equation (7) and shows that whatever improves 
PCU utilization will also improve the MNR performance. The increase of the communication 
bandwidth does not yield much processing power to the MNR architecture, but does reduce 
the PCU idle time, if the PEs are idle due to the inter-pRing communication. When the PCU 

20 is almost fully utilized, the further improvement in the communication bandwidth does not 
have much effect on the MNR performance. The differences between Fig. 42 and Fig. 43 are 
caused by a different neuron PE ratio. The ICU utilization in Fig. 43 (N/K=3) drops sooner 
than that in Fig. 42 {N/K=l). This shows that the demands for communication decrease when 
the neuron PE ratio increases. 

25 Inter-pRing communication time (r^^^) is defined in terms of communication bandwidth 

(B), precision bits (P). neuron PE ratio {N/K), and pRing size {k-K/M). Equation (8) shows 
the time needed for a pRing to exchange a data vector with another pRing. 
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iixKxP (8) 
K 



B 



Since the pRing interface unit contains both sending and receiving devices, the sending and 
receiving data can take place at the same time. For Af pRing, the communication time needed 
for a network cycle (assuming fully-connected network so every neuron needs all the others' 
values) should be M-1 times T^^g. And so equation (8) can be rewritten as: 

^xPxiM-i) (9) 



The utilization of ICUs regulates the contributions of interfece units to system speed. Z 
defined in equation (9) assumes the ICUs are fiiUy utilized which is definitely not the case. 
Tee«un defiHcs the system processing time spent on communications. ICU utilization defines 
10 the percentage of communication time in overall system processing time. Therefore, 
derivation of equation (10) from equation (9) is straightforward: 

^xPxiM-i) (10) 

Test#5 is designed to investigate the communication bandwidth's influences on system speed, 
and to validate equation (10). Although it is desirable to have an unlimited communication 

15 bandwidth, most of the time a devdoper would have a fixed bandwidtfi communication link. 
Besides, the high» the communication bandwidtii, the higher the cost and the lower the 
utilization of tiie communication channel. Since the number of PEs defines the processing 
power of the MNR architecture, it is important to find out how the speed relates to 
communication bandwidth and neuron-PE ratio. Test#5 is setup as folows: 

20 1. P = 8 bits 
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2. M = 5 

3. N = 30 a = 900) 

4. = 200ns 

The number of PEs (K) and communication bandwidth (B) are the variables in Test#5. The 
conditions of Test#5 are essentially the same as in Test#4. The PEs are distributed equaUy 
over the M pRings. N/K is varied under various communication bandwidths in Test#5. Since 
N, M, and P axe constants in Test#5, according to equation (10), the system speed is mostly 

affected by B and U,cu- 

With reference to Figs. 44 and 45, the need for communications decreases when the 
neuron-PE ratio increases. The communication bandwidth requirement is measured by ICU 
utilization. The ICU controls the pRing's interface unit. The ICU utilization then reflects the 
traffic of the communication channels. 

The increase in PCU utilization when communication bandwidth grows is significant. The 
relationship of the PCU utilization to performance of the MNR architecture is different for 
different neuron-PE ratios. 

Fig. 46 shows the variation in ICU utilization under various neuron-PE ratios. This shows 
that when the MNR architecture is simulating a larger ANN model (i.e., a large neuron-PE 
ratio), the communication bandwidth between pRings will not cause problems. The system 
processing speed does not increase linearly with PCU utilization ratio. This nonlinearity is 
caused by the communication problem. Equation (6) and equation (10) are combined to form 
a generalized MNR architecture processing time: 




IxNxNxTxP^ 



) 



(11) 



/ 



(12) 



UHR~ 



MNR 



Equation (1 1) assumes that the operations of PCU and ICU are fiiUy overlapped. Whichever, 
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PCU or ICU, needs a longer time to complete a network cycle dominates the system speed 
in the MNR architecture. The PE processing speed dominates the system speed, when PCUs 
are almost fuUy utilized, and vice versa. Equation (12) gives the generalized system speed 
for the MNR architecture. 

5 Bit-serial arithmetic processing ALU, residing in each PE, makes it possible to trade 

accuracy with speed and capacity. The arithmetic precision affects not only the arithmetic 
processing time in PEs but also the data exchange rate between pRings. Thus, the arithmetic 
precision influences the performance of the MNR architecture in two ways. First, it affects 
the PE processing time, i.e. . if the precision is F bits, then a multiply-accumulate needs 

10 clock cycles to be completed. Second, it affects the communication time through 
communication channels. Test#6 is designed to find out what the impact of the aritiiraetic 
precision is on performance of the MNR architecture. The setup parameters of Test#6 are: • 

1. B = 1 Mbits/second 

2. M = 5 

15 3. N = 30 a = 900) 

4. T^ = 200ns 

Arithmetic precision (P) is changed to investigate die effects of P on system speed and 
utilizations. The PCU utilization is predicted to increase rapidly with the precision, since 
increasing precision puts more processing loads on PEs. Because the precision affects the 
20 processing loads within PEs. the neuron-PE ratio(A^/K) is also changed to find out the 
properties of the MNR architecture in simulating larger models with different precision 
requirements. 

Fig. 47 shows the MNR architecture's abiUty in handling various predsions by the bit- 
serial ALU with proper degradation. The MNR architecture downgraded gentiy when 

25 precision bits increased. 

Fig. 48 shows the rapid increase of PCU utilization when the precision bits increase. The 
increases in PCU utilization does not lead to system speedup, since the overall network size 
Cm terms of interconnections) does not increase. 
The result of Test#6 shows the degradation in performance when arithmetic precision 
30 increased. The utilization of the PCU increases while the performance of the MNR 
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architecture decreases (Figs. 47 and 48). The ICU utilization decreases as predicted (Fig. 49). 
The reason for this is that the requirement of the PE processing time increases with while 
the communication time needed increases Unearly with P. The effects the precision has on 
communication are covered by the increased PE processing need. From the overaU speed 
point of view, decreasing the precision leads to speedup of the MNR system. This feature 
provides the benefits of trading precision with performance. 

Thus far, an in depth investigation of the properties of the MNR implementation 
architecture has been presented. The flexibility and power of the SimM tool ease die testing 
tasks. Several equations have been derived from architectural analysis and verified by 
simulation. Values and trade-offs have been presented for evaluating the performance of the 
MNR architecture. It has been determined that: 

N „ 
V 

'toot 

These equations generalize the MNR architecture's performance in terms of architecture 
parameters such as I, N, K, M, B, P, and of utilization figures. 

Simulation results under various conditions and variables were also derived and presented. 
The results agree with the predictions given in equation (1 1) and (12). It was also shown that 
the MNR architecture can be expanded in both speed and capacity by adding more PEs to the 
system; thus demonstrates the MNR architecture's ability for modular expansion. 

The MNR architecture is a modularly expandable system for large scale ANN 
implementation. Disregarding the PCU utilization loss due to the processing-control overhead, 
system speed and capacity of the MNR architecture increase linearly with the increased 
number of PEs. The MNR architecmre satisfies cost-effectively the necessity for physical 
system implementations of the theoretical large-scale neural network models. 
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The problem with evaluating the MNR architecture is that the large numba: of parameters 
needed for evaluation r^idly increases the complexity of evaluating the architecture. Several 
tests of performance are designed and presented from which useful conclusions were drawn. 
First, in Test#l, the MNR architecture's position in die DARPA speed/capacity plane is 
5 preswited (Fig. 9). The performance of the MNR architecture resides within the middle 
region as predicted. Both the speed and the capacity of the MNR architecture increase linearly 
witfi the number of PEs. 

Test#2 shows that the MNR architecture's speed increases linearly with the number of PEs 
and decreases linearly with the neuron PE ratio. This feature gives the advantage of 
10 investigating the properties of a large model's implementation by running a smaller model on 
the MNR architecture. 

Test#3 is designed to investigate the relationship between communication bandwidth and 
pRing size. The test shows that the best performance of the MNR architecture occurred when 
the PEs are nearly fully utilized. A low communication bandwidth in a multi-pRing system 
15 reduces the PEs' performance. Smaller pRings result in a lower system speed if given a slow 
communication channel. This suggests that a single large pRing may be preferred or a fast 
communication channel must be available. 

Test#4 demonstrates the relationship between the communication bandwidth and the neuron 
PE ratio. From the results of TescW, the demands for communication decrease when the 
20 neuron PE ratio increases. Thus, the communication bandwidth will not become a problem 
when the MNR architecture is used to simulate large ANN models (i.e., >. 

The effect of the arithmetic precision was investigated with Test#5. Increasing the 
predsions puts more processing loads on die PEs (by P^. Thus, PE utilization increases when 
the precision increases, but the increases in PE utilization mate no contribution to system 
25 speed. This result shows that speed can be traded for accuracy. 

Li general, trade-offe in the MNR architecture may be used to increase PCU utilization. 
Increasing in PCU utilization will also increase the MNR system speed. These trade-offs have 
been discussed above for various conditions. It is suggested, by simulation, that certain 
parameters should be considered when designing a MNR implementation system, namely: 

30 
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1. Neural-network size, the number of neurons (N) and the number of interconnections (1), 
because they define the processing load for the architecture. 

2. PE processing speed {T^, because PEs are the major processing power of the 
architecture. 

5 3. Arithmetic precision (P), because tb,- orimitive computation time (i.e., the time needed 

for a bitwise multiply-accumulate operation) of PEs increases by the square of P. 

4. Number of PEs (^0, because the speed and capacity increase linearly with the increased 
number of PEs, if they are fully utilized. 

5. Inter-pRing communication bandwidth (B), because it limits the PE processing speed, 
10 if the communication channels are not fast enough. The utilization of PEs will decrease when 

the PEs need to wait for communication. 

These parameters have a relation to system speed, which can be expressed as (11) and (12). 
The number of PEs determines the system's potential for speed. The utilization of the PCUs 
determines the percentage of this potential which is actuaUy been put to use. From this. It can 

15 be determined whether the system is operating at its fuU potential for speed. The utilization 
of the ICUs determines the expandability of the system configuration, since low utilization 
of the ICUs indicates that the communication channels are under-used by the pRings. So, 
adding more pRings will speed up the system, not only because of the increased number of 
PEs, but also because of the even more timely delivery of data vectors. If the communication 

20 bandwidth is large enough, the increased number of pRings will not slow down the delivery 
of data vectors. Hence, the utilization of ICUs defines the communication bandwidth needed, 
which, in turn, determines the expandability of the system configuration. 

From a designer's point of view, both high PCU utiUzation and high ICU utilization are 
desired. A high PCU utilization indicates that the system is operating at nearly its maximum 

25 speed. A high ICU utilization shows that the designer did not pay too much for the under- 
used communication channels. But as pointed out, a low ICU utilizatipn indicates the 
expandabiUty of the system. Thus, it is advisable to always seek high PCU utilization but do 
not insist on high ICU utilization. 

Fig. 50 demonstrates another speed and capacity performance of the MNR architecture as 

30 estimated from simulations of Hopfield model and multi-layered ANN topologies for different 
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cases of parameterization of the MNR architecture. This too is superimposed on the speed- 
capacity map adapted from DARPA (Fig. 9). It is clearly shown that the MNR architecture 
occupies the difficult diagonal region of the map. Further analyzed, the MNR architecture 
offers additional cost-performance trade-offs, as illustrated in Fig. 51. The areas plotted on 
the map in this figure indicate the performance estimates for MNR implementations (of the 
Hopfield network) at different levels of cost for memory, processors and other system 
components. The areas marked reflect ranges of performance per $-cost as it is overrated by 
cost efficiencies from 10:1 to 1(X):1. 

Fig. 52 shows the performance of a two pRing system processing a Hopfield ANN. The 
number of PEs per pRing is varied and the PE utilization and processing speed (in effective 
connections per second) is plotted. The processor utilization measures tiie percentage of time 
that the PCU is active instead of idling waiting for either an ICU synchronization point or 
waiting for die MCU to set up a new instruction block. From measurements taken in the lab, 
most of the PCU idle time is attributable to the MCU DMA set up overhead. Of course, PE 
utilization increases with the size of the network since the MCU overhead time is fixed and 
the PCU vector instructions times are strong functions of the number of PEs. The processing 
speed On connections per second) increases lineariy with the mimbet of PEs whidi, to first 
order, is expected. Plotted on the same figure is the system's burst speed assuming no 
activation function processing or overhead. 

Fig. 53 r^resents measurements taken when varying the number of pRings in tiie system. 
For tills test, fiilly populated pRings (40 PEs each) were used. In agreement with the previous 
test and simulation results, boUi tfie PE utilization and processing speed increases with die 
number of processors in die system. The system's burst speed (assuming no activation 
function or communication overhead) is also ploted on die same figure. In obtaining die data 
for tiiese charts, die mechanisms leading to die overhead were studied, and utilization and 
speed measuremraits taken. From this information, experimental data was extr^lated to a 
fully configured prototype system (i.e. widi die maximum of 16 pRings on die bus). To 
achieve tiiis level requires merely fabricating anodier ten pRings with no design 
modifications. Even for die modest sized prototype, over 10' interconnections per second are 
adiieved. 
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A three-layered neural network with error backpropagation (BP) learning rule were also 
implemented. The measurements of the BP model were taken from temporal decomposition 
with 16-bit precision for weights and activations. A BP model includes a forward phase and 
a learning phase. Thus.a multi-layer feedforward model's measurement was taken from the 
forward phase of the BP implementation. Fig. 54 represents measurements taken when 
varying the number of pRings and the number of PE per pRings in the system. For this test, 
1-5 pRings were used each with 8-40 PEs, so the measurements covers the performance 
figures ranging from 8 PEs to 200 PEs. Fig. 54 reassured the linear scalability of the MNR 
architecture. Whether the PEs are in the same pRings of not, The performance 
(interconnection/second) of the MNR architecture increased linearly with the number of PEs 
in the system. 

Fig. 55 rq)resents the performance of the MNR prototype hardware implementing a three- 
layered neural net with BP learning rule. Every learning cycle consists of a feedforward phase 
and a BP phase. The activation function of the BP model required a smooth function (e.g. 
a sigmoid function). The logistic sigmoid function was implemented using a quadratic Taylor 
series expansion in ten int^als. Since the table of constants for this approximation is stored 
in each PE, it is possible to have different activation function for PEs in the same pRing. The 
measurements in Figure 24 also covers the learning performance ranging from 8 PEs to 200 
PEs. The learning performance, again, linearly increased with the number of PEs in the 
system. 

The PEs in the MNR system are the source of the processing power. Thus the art of 
efficient programming in the MNR system is transformed to the subject of keeping as many 
PEs as busy as possible. Fig. 56 shows the PE utilization which rq)resents die efficiency of 
MNR programmiiig in accordance with the invention. The PEs in the system are k^t at over 
90% utilized as shown in Fig. 56. However, the utilization slighdy increase as the size of the 
model increase. This accounts for the quadratic growth in processing requiranent and the 
linear growth of communication requirements when the model grows. Thus, the MNR system 
will kept near 100% utilization (in terms of PE utilization) when implementing large models 
forwhich the architecture is suited. 

Different ANN models are expressed by different pRing control programs for the MNR 



architecture. Performance evaluation of different ANN models in the MNR architecture is 
accomplished essentially by determining the performance of the architecture when executing 
the corresponding pRing control programs. Utilization of the dynamic devices determine the 
major effect that the ANN models have on the performance of the MNR architecture. 
5 Research efforts are needed to define the specific relationship between various ANN 

models and the utilization of the corresponding devices in MNR implementations. In other 
words, the representative instruction-mixes for diffCTent ANN models should be defined. The 
utilization of the corresponding devices, then, can be determined from these instruction- 
mixes. The performance of the MNR architecture in simulating various ANN models can then 
10 be investigated in more depth. 

The investigation of the MNR architecture is currently based on a single-bus pRing 
structure. Other possible communication topologies are available. Other dynamic 
communication network topologies, such as multi-bus structures, are possible alternatives to 
the single-bus one. 

15 The single-bussed MNR structure can be used, when the design allows any of the M pRings 

to connect to any other pRing. In this case, any pRing-pair can use the bus for 
communication. The MNR architecture also can support multi-destination communications, 
with which a pRing can send the same data vector to multiple pRing-destinations 
simultaneously. 

20 When M, the number of pRings, is large, extremely fast busses are required, and special 

design and programming precautions must be taken to minimize the need for access to the 
bus. By providing several busses in the architecture, these precautions may be eased. Each 
pRing can connect to one or more of the available busses. The multi-bussed system not only 
reduces the communication load per bus but also provides a degree of fiauxlt tolerance. 

25 Multiple busses also provide even more flexibility to the MNR architecture. The bussed 

pRing slab can then become an allocatable unit in an MNR workstation environment 

To obtain the most out of the MNR architecture, a high-level ANN specification language 
and the associated optimizing compiler is useful. The language preferably performs efficient 
assignment of PEs and communication scheduling. The details of the architecture are hidden 

30 fit>m the programmer as much as possible. The optimizing compiler genwates the required 
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pRing machine codes based on the current configuration of the system, since the MNR 

architecture is field-upgradable. 

Occam has been proposed as a higher level language for ANN model implementation over 

transputers. ANSpec has been developed by SAIC to model massively parallel distributed 
5 systems which can also be used to specify and manipulate ANN models. NNDL(Neural 

Network Design Language) has been developed for processing vectorized data sets parallelly 

on neural networks. The MNR optimizing compiler is needed for the specification language 

of choice. The dynamic PE and pRing assignments and the communication scheduling 

according to the assignments affect the compiler. 
10 The behavior of the digital computer is described by deterministic and precise languages. 

The digital computer generally can accept only clear and complete information. In contrast, 

ANNs can operate with fuzzy data to produce relatively reasonable results. 
Fuzzy logic methods of data representation and processing can be applied to artificial 

neural networks. Fuzzy logic can be used to deal with uncertain information processed by 
15 ANNS. 

Both neural network and fiizzy logic apply experimentally verified rules rather than 
algorithmic formulas to the reasoning problem. That is, inductive reasoning rather than 
deductive reasoning. This property provide a more direct emulation to the human brain. 
Therefore, incorporating neural network processing with fiizzy logic should supply a more 

20 representative solution to this world of fuzziness. 

The pRings within the MNR architecture are essentially vectorized processors. The sum-of- 
product operations of neural networks are Uie basic matrix-computation operations. With the 
MIMD nature of the MNR architecture, there exist applications otiier than ANN 
implementations. The MNR architecture offers an excellent architecture for matrix operations. 

25 Applications that involve regular operations, like in matrix operations, will be suitable to 
execute on the MNR architecture. The data assignments in Uiis kind of application should be 
handled very carefully. Studies should be made to find a general algorithm for the MNR 
architecture to be used on non-ANN applications. 

AlUiough the present invention has been described with reference to a preferred 

30 embodiment, the invention is not limited to the details thereof. Various modifications and 
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substitutions will occur to those of ordinary skill in the art, and all such modification and 
substitution are intended to fiall within the sprit and scope of the invention as defined in the 
appended claims. 
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V 

Appendix A Instruction Set and Machine Code Definition 



MCU ASM Instructions 





tvoe 


len 


OP CODE 


m 


r2 


rl 


comments 


ASM Instruction 


1 1 


1 1 


1 1 




543 








54 


32 


1 0987 


6 


2 1 0 




mov dd, n 


1 0 


0 1 


00000 


0 


rr 


dd 




mov dd. @n 


1 0 


0 1 


00000 


1 


rr 


dd 




mov dd, #cnst 


1 0 


1 0 


00000 


0 


1 1 1 


dd 




mov dd, @cnst 


1 0 


1 0 


00000 


1 


1 1 1 


dd 




mov dd, rr 


1 0 


0 1 


A A A A 4 
0 0 0 0 1 


A 

u 


rifi 
UU 


r 1 




mov @dd, rr 


1 0 


0 1 


A A A A ^ 
0 0 U U I 


4 

1 


UU 


rr 




mov @cnst, rr 


1 0 


1 0 


A A A A 1 
0 Q U 0 1 


1 


i i 1 

111 


rr 




Illegal inst. 


1 0 


1 u 


A n A n 1 
U U U U 1 


A 

u 


x X X 


1 1 1 




add dd, rr 


1 0 


A 4 

0 1 


A A A < A 
0 0 0 1 0 


A 

u 


rr 


□a 




add dd, cnst 


1 0 


4 A 

1 0 


A A A < A 

0 0 0 1. U 


A 

u 


111 

111 


uu 




add dd, @rr 


1 0 


A 4 

0 1 


A A A 4 A 

0 0 0 1 0 


4 

1 


rr 


HH 
□u 




aoQ Ou, @)cnsi 


I U 


1 U 


A A o i n 
U U U 1 u 


i 

1 


111 


rid 
uu 




sub dd, rr 


1 0 


f\ 4 

0 1 


A A 4 A A 

0 0 10 0 


A 

0 


rr 


HH 

Qu 




sub dd, cnst 


1 0 


4 A 

1 0 


/\ A 4 A A 

0 0 10 0 


0 


III 


HH 

QU 




sub dd, (§>rr 


1 0 


f\ 4 

0 1 


A A 4 A A 

0 0 10 0 


4 

1 


rr 


HH 

UU 




SUD UQ, ig'C*^^ 


1 n 


1 n 




-f 


111 


dd 




cmp dd, rr 


1 0 


A 4 


^ A i A A 
1 U 1 0 U 


A 


IT 


HH * 
uu 


UUI 1 I 9lVi «7 


cinp uQ, cnsi 


1 n 

1 u 


1 n 

1 u 


1 n 1 n n 

1 U 1 u u 


0 


111 


dd 


r6SuK. 


cmp dd, @rr 


1 0 


01 


10 100 


1 


rr 


dd 




cmp dd, @cnst 


1 0 


1 0 


10 100 


1 


1 1 1 


dd 




and dd. rr 


1 0 


A 4 

0 1 


A A 4 4 A 

0 0 110 


A 


rr 






ana uu, cnoi 


1 u 


1 n 

1 u 


0 0 110 


0 


111 


dd 




and dd, @rr 


1 0 


01 


00110 


1 


rr 


dd 




and dd, @cnst 


1 0 


1 0 


00110 


1 


1 1 1 


dd 


• 


or uo, rr 


1 u 


U 1 


n n 1 1 1 


n 

w 




dd 




Ui uu, WllOl 


1 0 


1 0 


0 0 111 


0 


111 


dd 




or dd, @rr 


1 0 


01 


00111 


1 


rr 


dd 




or ddt @cnst 


1 0 


1 0 


00111 


1 


111 


dd 




xor dd, rr 


1 0 


0 1 


01000 


0 


rr 


dd 




xor dd, cnst 


1 0 


1 0 


01000 


0 


1 1 1 


dd 




xor dd, @rr 


1 0 


01 


01000 


1 


rr 


dd 




xor dd, @cnst 


1 0 


1 0 


01000 


1 


1 1 1 


dd 




HALT 


1 0 


01 


11111 


0 


000 


000 




SYNC 


1 0 


01 


11110 


0 


000 


000 




jcc rr 
ICC xxxxx 
jnoc rr 
jncc xxxxx 


1 0 


01 


01010 


0 


cc 


rr* 


•n- 0..6 : jmp to 


1 0 


1 0 


01010 


0 


cc 


1 1 1* 


addr (rr) 


1 0 


0 1 


01010 


1 


cc 


rr* 


**rr7 : jmp to 


1 0 


1 0 


01010 


1 


cc 


1 1 1 


addr XXXXX 


Ccc n- 


1 0 


01 


01110 


0 


cc 


rr* 




Ccc xxxxx 


1 0 


1 0 


01110 


0 


cc 


1 1 1* 




Cnccrr 


10 


01 


01110 




cc 


rr* 




Cncc xxxxx 


1 0 


1 0 


01110 


1 


cc 


1 1 1 




rcc 


1 0 


01 


01100 


0 


cc 


XXX 




mcc 


1 0 


0 1 


01101 


0 


cc 


XXX 
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cc 


z 


000 


ncc 


NZ 


000 






p 


001 




NP(M) 


001 






c 


010 




NC 


01 0 






RBSY 


01 1 




NRBSY(RRDY) 


01 1 






SBSY 


1 00 




NSBSY(SRDY) 


1 00 






PBSY 


1 01 




NPBSY(PRDY) 


101 






TRUE 


1 1 1 




NTRUEfFALSE) 


1 1 1 





PCU ASM Instructions 



MAC 



type 


len 


oph 


s 


rr 


ff 


dd 


1 1 


1 1 


1 1 










54 


32 


1 0 


9 


876 


543 


210 


01 


0 1 


00 


1 


rr 


ff 


dd 


0 1 


0 1 


00 


0 


rr 


ff 


dd 


01 


0 1 


00 


1 


1 1 1 


1 1 1 


1 1 1 


01 


0 1 


00 


0 


1 1 1 


1 1 1 


1 1 1 



comments 



xmac @dd.@n',ff 
mac @dd.@rr.ff 
shift P 
nop 



P<-> A/W 



type 


len 


oph 


OP 


p 


cc 


a 


D 


rr 




1 1 


1 1 


1 1 




n 




w 






comments 


54 


32 


1 0 


98 


7 


65 


4 


3 


210 




0 1 


0 1 


1 1 


00 


0 


00 


0 


0 


rr 




0 1 


0 1 


1 1 


00 


0 


GO 


0 


1 


rr 


P->AW 


01 


0 1 


1 1 


00 


0 


00 


1 


0 


rr 


0-1: 


0 1 


0 1 


1 1 


00 


0 


00 


1 


1 


rr 


P<-/VW 


01 


0 1 


1 1 


00 


0 


CC 


0 


0 


rr 




0 1 


0 1 


1 1 


0 0 


0 


CC 


0 


1 


rr 




01 


01 


1 1 


00 


0 


cc 


1 


0 


rr 


cc cond 


01 


01 


1 1 


00 


0 


cc 


1 


1 


rr 


0 no 


01 


0 1 


1 1 


00 


1 


cc 


0 


0 


rr 


1 Z 


0 1 


0 1 


1 1 


00 


1 


cc 


0 


1 


rr 


2 C 


01 


0 1 


1 1 


00 


1 


cc 


1 


0 


rr 


3 S 


01 


0 1 


1 1 


00 


1 


cc 


1 


1 


rr 





piita @n- 
geta @Tr 
putw @Tr 
getw @rr 
putacc @rr 
getacc @rr 
putwcc@n' 
getwcc @nr 
putancc @rr 
getancc @rr 
putwncc @rr 
getwncc @rr 



ALU 



type 


len 


oph 


OP 


rr 


1 1 


1 1 


1 1 






54 


32 


1 0 


9876543 


210 


01 


01 


01 


0000000 


rr 


01 


0 1 


0 1 


00000 10 


rr 


01 


0 1 


0 1 


00000 1 1 


rr 


01 


0 1 


0 1 


0000 100 


rr 


01 


01 


0 1 


0000 1 0 1 


rr 


01 


0 1 


0 1 


0 000110 


rr 



comments 



adda @rr 
suba @rr 
cmpa <§>rr 
anda @n 
ora @rr 
xofa@rr 




Buffer 



type 
1 1 
54 



len 
1 1 

32 



oph 
1 1 

1 0 



OP 
987654 



bufr 
3 2 



bufd 
1 0 



comments 



mov bd, br 
xct^ bd, br 



01 
01 



01 
01 



1 0 
1 0 



000000 
000001 



br* 
br* 



bd* 
bd* 



*P 0 
R 1 
T 2 
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1 ICU ASM Instructions 




T 

54 


len 
1 1 
32 


OP CODE 
1 1 

1 0987654 


ports 
3210 


comments 


1 send port 
1 rev port 


1 1 
00 


01 
0 1 


xxxxOOOO 
xxxxOOOl 


port 
port 


x: doni care. 
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Appendix B Simulated ANN pRing Program Listing 

Program for a 30 neim^ns hopfieid neL Program shows program for pRing 1 in a 3 
pRing, 5 PE-per-pRing MNR system. 



0000 
0000 
0000 » 
0002 = 
0005 = 
0030 = 
003£ = 
003F = 
0040 » 



bus: 
right: 
k: 

zero: 
thrsh: 
acti: 
act2: 



0000 A0300064 
0004 Ipl: 
0004 A038063O 
0008 5C18 
OOOA A0380001 
OOOE 5C00 
0010 A0380002 
0014 5C00 
0016 A03B0001 
001A A03A0001 
001EA0380003 
0022 5C08 
0024 A0390001 
0028 5201 
002AA13A0005 
002E A0390002 
0032 A52F=0032 
0036 52D1 
0038 A13A0005 
0O3CA52F003C 
0040 5802 
0042 A03800O4 
0046 5C08 
0048 A03C0003 
004C lp2: 
004C A0390a)1 
0050 5201 
0052 0002 
0054 1010 
0056A13A0005 
005AA0390002 
005EA52F005E 
0062 5201 
0064 A13A0005 
0068 A51F0068 
006C A527006C 
0070 A52F0070 
0074 5802 

0076 5804 

0078 A13CFFFF 
007CA547004C 
0080 AOSgOOOl 
0084 5201 
0086 0002 
0088 1010 
008AA13A0005 
008EA0390002 
0092 A52F0092 
0096 5201 
0098 A13A0005 
0OgCAO39O0O1 
OOA0A51F0OAO 
0OA4A52F00A4 
0OA8 5804 
00AA52O1 
OOACA13A0005 
OOBO A0390002 
00B4 A52f=00B4 
00B8 5201 
OOBAA0390001 
OOBEA03A0003 
0OC2A73F00EO 
00C6 A0390002 
0OCAA03A00O4 
00CEA73F00EO 

0002 9FO0 

0004 A130FFFF 

0OO8A54700O4 

0ODCA53FO0FC 



cpu 
hof 
equ 
equ 
equ 
equ 
equ 
equ 
equ 

mov 

mov 

gew 

mov 

puta 

mov 

puta 

mov 

mov 

mov 

geta 

mov 

xmac 

add 

mov 

jpbsy 

xmac 

add 

Jpbsy 

mov 

mov 



mov 

mov 

xmac 

send 

rev 

add 

mov 

jpbsy 

xmac 

add 

jpfasy 

mov 

mov 

add 

jnz 

mov 

xmac 

send 

rev 

add 

mov 

jpbsy 

xmac 

add 

mov 

irbsy 

jpbsy 

mov 

xmac 

add 

mov 

ft)bsy 

xmac 

mov 

mov 

caU 

mov 

mov 

caJJ 

SYNC 

add 

inz 



•pring.tbr 

0 

2 

5 

61 

62 

63 

64 

r5.#100 

rO,#zero 

@f0 

r&.#1 

fe,#2 

trO 
.#1 
r2,#1 
rO.#3 
@r0 
M[.#1 

gMi,@r2,r3 

n;#2 

|rj.@r2.r3:. 

$ 

r<J!#4 
@rO 
r4.«3 

n.#i 

@r1.@r2.r3 

nght 

bus 

r2,#k 

r1 ,#2 

$ 

@r1,@r2,r3;. 



:pe/pring 
;adaress of 
[Constants in 
.weight 
:memory 

;f or 1 0 network cycies 
:..initaccum (1.2) to 



;„set xmac offset =• 1 
;..wbase = 1 
:.-P = 3 

:..xmac 1.wbase,1 

;..wbase = wbase -t^ k: 
;..set up for next xmac 
:..wait PE 
.xmac 2, wbase. 1 
;..wbase » wbase -t- k 
:..wait PE 
:..t=p 
:..p»4 

:..for 3 iterations { 

; xmac 1, wbase. 1 

:....send right 
;....rcv bus 

;....wbase = wbase k 
.....set up for next xmac 
;....wait PE 
...xmac2,wbase,1 
;....wbase = wbase -t> k 
:....wait rev,snd,pe 



l = p 
.p»r 



•xmac 



1, wbase, 1 
right 



:..send 
;..rcv bus 
.wbase = wbase k 
:..set up next xmac 
;..wait pe 
jcmac 2, wbase 
;..wbase = wbase + k 
;..set up next xmac 
:..wait rcv.pe 

:.-P = r 
^.xmac 1. wbase, 1 

;..wt>ase = wbase ^ k 

:..set up next xmac 

;..wait pe 
;jcmac2.wba5e,1 

;..CompuiB activation (1 >> 3) and (2 -> 4); 



;..Compute activation (1 -> 3) and (2 -> 4): 



;ALLDONE 



; activaiion computation, bnary output 
: rl » @summaaon, r2 = @activaiion 
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OOEO acQve: 

00E0A038003E mov 

00E4 5C18 getw 

00E6 5411 suba 

00EeA038003F mov 

00EC5G18 getw 

OOEE A0380040 mov 

00F2 5C78 gtwp 

00F4 5C02 puta 

00F6A52F00F6 jpbsy 

00FA9638 ret 




ifetch threshotd from weight mem 

;aoc - threshotd 

;9et tow value of activation 

:set up to get high value 

:get high value if acc > triresh 
;put away value 



OOFC 9F80 
0000 



STOP: HALT 
END 
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Appendix C SimM User's Manual 
C.1 Introduction to SimM 

SimM (for "5/mularion of AfNR") is intended to represent the MNR architecture in 
software modules, to provide an experimental environment for architectural trade-off confir- 
mation and to allow ANN model dependent performance analysis* SimM also will be used as 
a MNR architecture development tool to assist in communication conflict resolution, to debug 
pRing programs, to analyze hardware utilization and to experiment with various hardware 
configurations. SimM can also serve as an ANN model development tool to confirm the 
predicted activities of a proposed new ANN modeL 

To accomplish die objectives stated above, SimM should be able to restructure itself tiu-ough 
changes in parameters. These parameters include architectural parcunexers, r.pology parame- 
ters and pRing control programs, 

SimM also provides a set of on-line commands. One can operate SimM as if operating a 
real MNR machine. This tool allows the developer to examine and ;hange ±e value of 
accumulator-memory and weight memory, to monitor the changing ANN states at various 
times and to check current utilization of devices. SimM offers all the features needed for 
developer, and more. Witii SimM, one can test a new ANN model on the MNR system as well 
as test a new MNR design with an existing ANN model. 
C.2 SimM command line parameters 

At the DOS prompt (i.e., C:>) type SIMM without any command line arguments. SimM 
will response with: 

USAGE: 

SIMM <architecture> <topology> <program> <neuron map> <weigfatmatrix> 

SimM is telling you that some important parameters are necessary for processing. We 
will discuss these parameters in diis ch^ter. After finishing this chapter, you will be able to 
create a MNR architecture with those parameters on your own. The architecture file contains 
defiiurions of the MNR hardware architecture. The topology definition file has the definitions 
of pRing connections. The program file defiiies the pRing control program. The neuron map 
file tell SimM whore neurons are and where to show them. The weight matrix file holds the 
initial value of weight memory. With these parameters, SimM builds a MNR system environ- 
ment to be used. 
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C.2.1 Architecture file 

The architecture file contains parameters needed to create pRings. The parameters are 
entered as a keyword followed by an integer. Table 1 shows the currently available parame- 
ters and their meanings. 

A sample architecture file for a 3-pRing, 5-PE-per-pRing MNR . vstem is presented in 
Figure C.l. The MNR system has 3 pRings. Each pRing contains 5 PEs. Each PE has 61 bytes 
of weight memory and 16 bytes of accumulator-memory. All PEs employ 16 bit arithmetic 
operations. The relative system clock runs at 20 Mhz(50 ns clock cycle). PEs of the MNR 
system run at 20 Mhz while the MCU runs at 4 Mhz, and the Interface(IF) rans at 2 Mhz. 
Every pRing in the system has 5 commimication pons each with a conamunication bandwidth 
of 16 Mbits/second. 



KEYWORD 


MEANING 


NO^PRINQ 


number of pRings on system. 


NO_PE_PRING 


number of PEs within each pRIng. 


WMEM^SIZE 


size of weight memory within each PE (bytes). 


AMEM^SIZE 


size of accumulator*memory within each PE (bytes). 


COMM^PORT 


number of communication ports for a pRing. 


CHANNEL.BANDWIDTH 


channel bandwidth of communicatton lines (bits/IF_SPEED). 


ARITHMETIC.LEN 


arithmetrc predston (bits) of PEs. 


CLOCK 


relative system dock cyde time (/is). 


CU.SPEED 


pRing control unit (MCU) speed according to relative system 
clock. 


PE^SPEED 


PE speed according to relative system dock. 


IF^SPEED 


Interface unit (IF) speed according to relative 

system clock. I 



Table 1. Architectural parameters 
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NO_PRING3 
PE_PRING5 
ARITHMETIC.LEN 16 
CHANNEL.BANDWIDTH 8 
COMM_PORT 5 
CLOCK 50 
PE.SPEED 1 
CU.SPEED 5 
IF.SPEED 10 
WMEM_SIZE 61 

AMEM_SIZE 16 

Figured Sample SimM architecture file 
C.2.2 Topology definition file 

The topology file starts with the keyword TOPOLOGY, followed by pRing connection 
definitions. Every row of data begins with a pRing identity (ID), followed by the bus connec- 
tion, then the conununicarion ports. pRing ids range between 1 and the maximum number of 
pRings specified in the architecture file. The row of data whose pRing ID is 0 defines the bus 
identity of this MNR system. All bus connections should refer to the bus identities defined in 
row 0. Communication ports should contain either a pRing identity or -l(no connection). A 
legal pRing identity in a pRing communication port means that a communication link is 
established between thena. SimM will connect die current pRing witii the pRing whose iden- 
tity shown in die communication pon indicated. pRings communicate by way of communica- 
tion ports, so the topology definition should always remain consistent, i.e., connection 
definition should by specified by both parties of a communication link. If pRing 1 has a 
connection to pRing 2 tfirough one of pRing I's communication ports, then pRing 2 should 
have a connection to pRing 1 via one of its communication ports also. 

SimM will automatically reject illegal pRing and bus designators. SimM also checks 
topology consistency, and requests the error, if there is one, be fixed before any further 
processing. Figure C.2 is a sample topology definition file. In Figure C.2, we defined a single 
bussed MNR with three pRings. The connections in Figure C.3 show die topology resulting 
ficom the definition in Figure C.2. 



TOPOLOGY 

0 0 1-1-1-1-1 

1 1-1 2-1-1-1 

2 1 13-1-1-1 

3 1 2-1-1-1-1 



Figure C.2 Sample topology definition file 
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C.23 Program file 

Every pRing has its own special control program to carry out the tasks designated to it. 
In real system the control programs are provided by host computer during system initializa- 
tion. During simulation, you tell SimM where to find the control program for every pRing. 
The control program should be compiled to MNR machine language in standard Intel Hex 
format. You put die information about control programs in a program file to teU SimM where 
to find them. The program file begins with the keyword PROGRAM and end with the 
keyword PROGRAM_END. Every individual pRing control program filename has it own 
line in the program file between the keyword PRING_ID and the keyword END. The number 
foUowing PREVGJD is, obviously, the pRing ID to which the pRing control program 
belongs. Different pRings could use the same pRing control program by putting more than 
one PRING_ID line before the END line. Figure C.4 shows the program file for a 3-pRing 
MNR. pRing 0 is designated to be the host computer or the interface to die host computer. 
There is no control program for pRing 0 since at present, we don't consider the interfacing 
problem with die host computer. Eventually we wiU take digt into consideration, at which 
point all you will need to do is to put die host (or host interface) control program in die 
program file. 



PROGRAM 
PRING_ID 0 
END 

PRING.ID 1 

hopal.hex 

END 

PRING_ID 2 

hopa2.hex 

END 

PRING.ID 3 

hopa3.hex 

END 

PROGRAM_END 



Figure C.4 Program file 

C.2.4 Neuron map file 

The neuron map file is die way to tell SimM where die neuron activation values are and 
where on screen you want to see diem Neuron map files have two portions: defmition and 
neuron map. The definition has only a single Une in die neuron map file-die first line. The 
definition specifies die number of neurons in die system and die du-eshold to display as active 
or inactive. A neuron map specifies a specific accumulator-memory location as a specific 
neuron. Every line of a neuron map specifies a single neuron by giving die neuron ID, pRing 
ED, PE ID, accumulator-memory address, page of screen, and die row and column on screen. 
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Figure C5 shows a neuron map file with 30 neurons with a threshold of 0. The activation 
value for neuron 0 is stored in pRing 1, PE 0, accumulator-memory address 3. Neuron 0 is to 
be shown on the first page of die screen at row 2, column 4. The display screen of SimM 
measures 48 by 16. That make maximum numbw of neurons in a single page limits to 768, 
but SimM allows multiple page display to overcome this limitation. Figure C5 shows the 
neuron definitions in a neuron map file. 



300 




0 103 124 




1 113 126 




2 123 128 




3 1 3 3 1 2 10 




4 1 4 3 1 2 12 




5 104134 




6 114 136 




7 124 13 8 




8 1 3 4 1 3 10 




9 1441312 


10 20 3 154 




112 13 156 




12 2 2 3 1 5 8 




13 2 3 3 1 5 10 




14243 15 12- 




15204164 




162 14166 




17224168 




18 2 34 16 10 




192441612 




20 3 0 3 1 7 4 




21 3 1 3 1 7 6 




22 3 2 3 1 7 8 




23 3 3 3 17 10 




24 3 4 3 1 7 12 




25304184 




26 3 1 4 1 8 6 




27 3 2 4 1 8 8 




28 3 3 4 1 8 10 




29 3 4 4 1 8 12 





Figure C.5 Contents of a neuron map file 



C2-S Weight matrix file 

Although you can access every weight memory location in SimM, it is a time consum- 
ing process to assign every weight memory an initial value. SimM can invoked with a weight 
matrix as a parameter. The weights should be organized to match the requirement embedded 
in the pRing control program. SimM assumes the weight matrix file, whose filename appears 
as parameter when SirhM invoked, is correctly organized as required. 

The requirements for a weight matrix file are: 
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1. The weights should be organized by pRing numbers. 

2. The number of columns in a line should equal the number of PEs in the pRing. 

3. The number of lines in a weight matrix file should equal the weight memory size 

times the number of pRings. 

4. The first line of a weight matrix file is assigned to first weight memory location of the 

first pRing, i.e., column 1 goes to PE 0, column 2 goes to PE 1. . . etc. 

Conversion program to convert a weight matrix to the weight matrix file format of 
SimM is also available as a supplement to SiniM. WCNVT takes a conventional weight 
matrix as input and outputs a SimM format weight matrix file. Since SimM reads standard 
ASCn files, you may want to type the initial value of weight memory in SimM format to a 
file. SimM can read that file as a weight matrix later. 

Figure C.6 is part of a weight matrix file for a 3-pRing, 5-PE-per-pRing Hopfield net. 
Two patterns are stored by this Hopfield net. 



0 0 0 0 0 
2 0 2 2 0 
0 0 0 2 0 
0 2 0 0 0 
0 2 2 0 2 
0 0 2 0 0 
0 -2 2 0 -2 
-2 -2 0 0 -2 
-2 0 0 -2 -2 
-2 0 2 -2 0 



-2 2 -2 0 0 
0 0 0 0 2 
0 0 2 0 0 
0 -2 0 2 -2 
-2 -2-2 0 0 
0 0-2-2 -2 
-2 2 0 -2 0 



Figure C6 Pan of sample weight matrix file 
C.3 SimM on-line commands 

After SimM is invoked, SimM sets up the MNR configuration according to your param- 
eter files. If successful, SimM will show you a monitor screen like Figure C.7 (the italics do 
not appear on screen.) Otherwise SimM responds with an error message. The screen is 
divided into 5 portions current command, error message, information window, simulation 
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state, and command menu. AU but the command menu are message areas showing the cunent 
stams of the simulation. We wiU discuss commands in detail later. Table 2. Shows die 



Area 


Contents 


cuxient command 


Shows the latest command issued 


error message 


Shows current error encountered 


information window 


Shows current activadon state of neurons defined in neuron 
map file and other simulation informations. 


simulation state 


shows current simulation state like, clocks, rimes, network 
cycles and current page of neiuron layouts. 



Table 2. Message areas of SimM moiutor 



(current command) 



SimM MONITOR 



(error message) 



{information windoWi 

10101 
00110 
11000 
0001 0 
10011 
00001 



Elapsed Clock 
Network Cycles 
Actrvatfon Sum 
Elapsed Time 

Current Page 
(simulation state) 



00000000 

00000000 

00000000 

00:00:00 

000.000 

01/01 



(command menu) 
1 Step Network 
6 Step PCU inst 



2 Run 

7 Step MCU inst 



3 pRing Stat 
8 Step ICU inst 



4 System stat 
9 Reset 



5 Step Clock 
10 Exit to Dos 



Figure C.7 SimM monitor mode screen layout 

Figure C8 shows all on-line commands provided by SimM. Most of SimM's on-line 
commands are shown on the command menu. Very few conmiands are not shown on menu 
including, toggle screen d/^p/ayCAlt-Fl) under monitor mode, modify accumulator-memory 
globally rAIt-F6) and modify weight memory ^/oZ?a/fy(Alt.F7) under pRing mode. Also PgUp 
and PgDn wiU change pages in both monitor and pRing modes. Figure C.9 shows the MNR 
system inforaaation provided by SimM (i.e., die user has pressed F4). 
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System Status 



SlmM Monitor 



No. of pRings : 3 
No. of PES per pRing : 5 
Clock cycle : 50 ns 
PE_speed : 1 times of Cloc)c cycle 
C0_9peed : 1 times 
IF_speed : 1 times 
Ar ithmet ic_lenght 
Channel Bandwidth 
pRing Comm. Ports 
Size of Wmem 
Size of Amem 



of Clock, cycle 
of Cloclc cycle 
8 bits 
320.000 (Mbits/second) 

5 

65 words (per PE) 
16 words (per PE) 



Elapsed Clock 
Network Cycles- 
Activation sum 
Elapsed Time 

Current Page 



00086868 
00000005 
0005.477 
00:00:00 
004 .343 
01/01 



Press any key. 



Free memory : 395688 bytes 



1 Step Network 2 Run 3 pRiiig stat 4 System stat 

6 Step PCU inst 7 Step MCU inst 8 Step ICU inst 9 Reset 



5 Step Clock 
10 Exit to Dos 



Figure C9 MNR System Status 



Some commands need additional infomiation. When you request SimM to step over 
MCU instruction(F7) in the monitor mode, SimM will ask for the ID of the indexed pRing. 
The indexed pRing serves as a relative check point for die other pRings. When you ask SunM 
to step over the MCU instruction of a specific pRing. SimM wiU treat the pRing as the 
indexed pRing and execute all the pRing programs until the indexed pRing completes a MCU 
instruction. Or when you tell SimM that you want to see what's in accumulator-memory(F2). 
SimM will ask you to specify which PE in the current pRing you want to see. Of course, 
SimM will ask you the address of accumulator-memory or weight memory when you tell 
SimM you want to change ±eir values by pressing F6 or F7 in the pRing mode. Global 
changes (Alt-F6, AU-F7) affect only the current pRing. 
C.3.1 Monitor mode commands 

Screen display is relatively slow since we used the ANSI driver to maintain portability. 
We suggest turning the screen display off while running large models to speed up the simu- 
lation process. Also, you needn't enter the indexed pRing ID every time when you use the 
step command to debug pRing control program because SimM autoniatically repeats your last 
command when you hit enter on the keyboard. The following is a conunand description for 
all commands in the monitor mode. 
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Keystroke 


Command 


Action of SunM 


Fl 


Step netwoik 


Step over ANN network cycles. ANN cycles are 
specified by the user in the pRing control {wogram. 


F2 


Run 


Execute pRing control programs unnl all pRmgs 
finish their control program. Refiresh screen whenever 
on A MM rvrle is finished even if sraeen display is 
turned off. 


F3 


pRing stams 


Qtwitr-h ro nRinff mode SunM will ask for the pRing 

lb. 


F4 


System stams 


Shows current MNR configuration. 


F5 


Step clock 


Execute a clock of control program for every pRing 
on system. 


F6 


Step PCtr instruction 


Execute until the indexed pRing finishes aPE instruc- 
tion. SimM will ask for the indexed pRing ID. 


F7 


Step MCU insmiction 


Execute until the indexed pRing finishes a MCU 
instruction. SimM will ask for the indexed pRing ID. 


F8 


Step ICU instruction 


Execute until the indexed pRing finishes an interface- 
rTR instruction SimM will ask for the indexed pRing 
ID. 


F9 


System RESET 


System reset SimM will read the weight matrix file 
to initialize weight memory. SimM will also ran- 
domly assign initial values to the accumulate memo- 
ries in which the neuron activation values store. 


FIO 


Exit to DOS 


Confirm if you really want to «dL If you do, firee all 
memories and retimi to DOS. 


Alt-Fl 


Toggle screen display 


Toggle screen display. 1 



C3J. pRing mode commands 

Pressing F3(pRing stams) in the monitor mode will bring you into the pRing mode. 
After answering die "pRing m?" question, SimM changes the command menu to the pRing 
mode. Figure CIO shows the pRing mode screen layout 

The commands available in die pRing mode and dieir descriptions follow. 
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pRlng Status 


pRing STATDS 










Elasped Clock : 


00086868 


pRing ID = 0001 






NewworK (^ycxQs > 




PC « 0088 






Activation Sum : 


0005.477 


SP = 0000 






Elasped Time : 


00:00:00 


FLAGS = CO 








004 .343 








Current Page : 


01/01 


MCO idles for 295 clock cycles 




PRING ID : 


0001 


1 PCU idles for 3973 clock cycles 




PE ID : 


0002 


IFS idles for 86484 clock cycles 








IFR idles for 86484 clock cycles 








1 MCU utilization 


99.66 % 








\ PCU utilization 


95.43 % 








1 IFS utilization 


0.44 % 








IFR utilization 


0.44 % 








1 Change pRing 2 Examine Amem 3 Examine Wlnnem 


4 


Examine Reg. 5 


pRirig Stat. 


6 Modify Amem 7 Modify Wmem 8 Modify Reg- 


9 


10 


main menu 



Figure C. 10 SimM pRing mode screen layout 



Keystroke 


Command 


Action of SimM 


Fl 


Change pRing 


Change current pRing. SimM will ask which pRing 
will the current pRing. 


F2 


Example Accumulator- 
memory 


Check accumulator-memory of a specific PE within 
current pRing. SimM will ask you for PE to be 
checked and will then set the PE to be current. 


F3 


Examine Weight 
memory 


Check weight memory of a specific PE within current 
pRing. SimM will ask you for PE to be checked and 
will then set the PE to be current. 


F4 


Examine pRing registers 


Check pRing registers, including PC, SP, Flags and 
all 7 common registers. 


F5 


pRing status 


Check current pRing stams, including hardware utili- 
zadon. 


F6 


Modify accumulator- 
memory 


Modify accumulator-memory of current pRing and 
PE. SimM asks for address and new value. 


Alt-F6 


Modify acciraiulator- 
memory globally 


Modify accumulator-memory for every PE of current 
pRing. SimM asks for address and new value. 


•F7 


Modify weight memory 


Modify weight memory of current pRing and PE. 
SimM asks for address and new value. 


Alt.F7 


Modify weight memory 
globally 


Modify weight memory for every PE of current 
pRing. SimM asks for address and new value. 
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1 


Modify pRing register 


Modify register value for current pRing. SunM aslcs 
for register number and new value. 


1 FIO 


Return to monitor mode 


Return to monitor mode. 



C.4 Limitation of SimM , . 

Thei^ are some constraints which apply to SimM. One of them is the hmitauon of 
memory. SimM runs under MS-DOS (PC-DOS) so SimM can't access more than 640 Kbyte 
of main memory although your machine may have several Mbytes. That limits SimM to only 
200 K comiecdons, i.e., 440 neurons fully connected. Speed is the other drawback of SunM. It 
runs relatively slow as an ANN model simulator. But our goal in designing SmiM was to 
implement architecmral simuladon of ±e MNR architecmre. Speed was not our major con- 
cern Also. portabiUty was a major concern. SimM can run only on PC-ATs or compatibles 
under MS-DOS currendy. However, being written in C without using non-portable functions. 
SimM is portable to any machine with a C compUer. 

The designers of SimM are working on revisions to SimM. One of die goals is to 
eliminate the menaory limitation in the current version of SimM. 
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Appendix D PCU Microinstruction Mnemonic and CMDASM Listing 



ACS 

AWE 

PCLK 

PCS 

PWE 

PRMACC 

LDPRM 

INCPRM 

PAD 

OP 

PHASE 

INCWA 

INCAA 

LDWA 

LDAA 

AOE 

iMEODIS 

IMO 

INV 

RDWM 

WRWM 

SUM2P 

CTR2P 

EXT2P 

OUTPRQ 

AASEL 

NOP 

JMP 

DJ2 

JPACK 

JPCY 

DECCTR 

LDCTR 

SRRD 

SRWR 

SRLTCH 

SRXD 

RDPRM 

WRPRM 



address 
opcode 



address 
address 



data 
data 



prq 



address 
address 
address 
address 

value 



address 
address 



PE Accummutator select 
PE Accumulator write enable 
Generate processor clock 
Parameter memory select 
Parameter memory write enable 
Accumulate parameter 
Load parameter accumulator 
Increment parameter accumulator 
Address parameter memory 
Generate PE board opcode 
Set PE board operation phase 
Increment weight address 
Increment PE accumulator address 
Load weight address 
Load PE accumulator address 
Enable PE accumulator memory output 
Disable instaiction immediate field 
Generate immediate data 
Generate inverted immediate data 
Read weight memory 
Write weight memory 

Select parameter accumulator for parameter input 

Select loop counter for parameter input 

Select external input for parameter input 

Output parameter re(^est 

Select weight address for accumulator address 

Continue 

Unconditional jump 

Decrement loop counter & jump on zero 

Jump if parameter acknowledge 

Jump if parameter accumulator carry 

Decrement loop counter 

Load loop counter 

Shift register memory read 

Shift register memory write 

Latch shift register menwry data 

Latch exchange data into shift register 

Read parameter memory 

Write parameter memory 



Table C.l Mnemonic PCU Instruction Set 
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OPFIELD 

r 

OPCODE 
•ACS". 
"AWE". 
"PCLX", 
"PCS". 
"PWE". 



opcodesQ « { 



MASK FIELD CODE HELD ^ ^TYPE V 

{0x00,0x00,0x00.0x80.0x04}. {0x00,0x00,0x00,0x00.0x04}, Nq_OP}. 
{0x00.0x00.0x00.0x80,0x20}. {0x00,0x00,0x00.0x00,0x20}, NO OP}, 
{0x00.0x00,0x00,0x80.0x10}, {0x00.0x00,0x00,0x00,0x10}, NO_OP}, 
{0x00,0x00,0x00,0x80.0x06}, {0x00,0x00,0x00,0x80,0x05}, NO OP}, 
{0x00,0x00,0x00,0x80,0x20}, {0x00,0x00,0x00.0x80,0x20}, NO OPh 
-PRMACC". {0X00.0x00,0x00.0x80.0x18}. {0X00,0x00,0x00,0x80.0x10}, NO OP }, 
"LDPRM". {0X00,0x00,0x00.0x80.0x18}, {0X0O,OxOO,0x(]gi,qx8O.0x1 8}^^ '^S-9£,V 
"INCPRM", {0X00.0x00.0x00.0x80,0x18}. {0X00,0x00.0x00,0x80,0x08}, NO OP }, 
"PAD", {OxOO.OxOO.OxOO.OxSF.OxOO}, {0x00,0x00,0x00,0x80,0x00}, LOWOP }. 
"OP" {0x00,0x00,0x00.0x8F,0x00}. {0x00.0x00.0x00,0x00,0x00}. LOWOP }. 
"PHASE". {0x00,0x00,0x00,0x80,0x08}, {0x00,0x00.0x00.0x00,0x08}. NO OP }. 
"INCWA". {0x00,0x00,0x00,0x80,0x40}. {0x00.0x00.0x00.0x00,0x40}, NO_OP }, 
"INCAA", {0x00,0x00,0x01,0x80,0x00}. {0x00.0x00.0x01,0x00,0x00}, NO OP}, 
"LDWA", {Ox7F.OxFF.Ox0O.0x8O,Ox40}, {0x00.0x00,0x00,0x80,0x40}, HM5 }, 
"LDAA". {Ox07,OxFF,Ox01 ,0x80,0x00}, {0x00,0x00,0x01.0x80.0x00}, HM 1 } 
"AOE". {0x00,0x00,0x00,0x00,0x80}, {0x00,0x00.0x00,0x00,0x80}, NO C>P }, 
"IMEDDIS", {0x00.0x00.0x00.0x00.0x01}. {0x00.0x00.0x00.0x00.0x01}. NO OP}, 
"IMD". {OxFF,OxFF,OxOO,Ox00,0x00}, {0x00.0x00,0x00,0x00.0x00}, HI_OP }, 
"INV", {OxFF.OxFF.OxOO.OxOO.OxOO}. {0x00,0x00,0x00,0x00,0x00). INVOPJk 
"RDWM", {0x00,0x00,0x06.0x80,0x00}, {0x00,0x00,0x02,0x00,0x00}, NO OP}. 
"WRWM". {0x00.0x00.0x06.0x80.0x00). {0x00.0x00.0x04.0x00,0x00). NO OP }. 
"SUM2P-. {0x00.0x00.0x06.0x80,0x00). {0x00.0x00.0x00.0x80.0x00). NO OP ), 
"CTR2P", {0x00.0x00,0x06,0x80,0x00}, {0x00,0x00,0x02.0x80,0x00}, NO_OP }. 
"EXT2P", {0x00.0x00.0x04.0x80.0x00}, {0x00,0x00.0x04,0x80,0x00}, NO^P ), 
"OUTPRQ-, {0x80.0x00,0x06,0x80,0x00), {0x00,0x00,0x06,0x80,0x00}, HIBIT }, 

"AASEL ' ~ " 

"NOP". 
"JMP". 
"DJZ". 
"JPACK 
-JPCY". 



. {0x00.0x00.0x00.0x00.0x02}. {0x00.0x00,0x00.0x00.0x02), NO_OP }, 
{0x00,0x00,0x00,0x00,0x00}, {0x00,0x00,0x00,0x00,0x00}, NO_OP ), 
{Ox07,OxFF.OxE0.0xOO.0x00}, {0x00.0x00.0x20.0x00.0x00}. HM 1 ). 
{0xO7.0xFF.OxEO,Ox00,0x00}, {0x00.0x00,0x40,0x00,0x00}, HMI ). 
, {OxO7.OxFF.OxEO.0xOO,Ox0O). {0x00.0x00.0x60,0x00,0x00}. Hl.ll }, 
{Ox07,OxFF,OxE0,0xOO.Ox0O}, {0x00,0x00,0x80.0x00jpx00}. HI 1 1 } 
DECCTR", {0x0O.Ox0O.0xE0,0x00.0xO0), {0x00.0x00,0x00,0x00.0x00). NO_OP ), 
LDCTR", {Ox07,OxFF,OxEO,0xOO,Ox00), {0x0O,OxOO,0xEO,OxOO.OxOO), INy914}. 
SRRD-, {0x00.0x00.0x00.0x70.0x00}. {0x00.0x00,0x00.0x10,0x00), NO OP), 
SRWR", {0x00.0x00,0x00,0x70.0x00}, {0x00.0x00.0x00,0x30,0x00}, NO-OP }. 
SRLTCH", (0x00.0x00.0x00.0x70.0x00). {0x00.0x00.0x00,0x50.0x00}, NO OP }, 
SRXD-, {0x00.0x00.0x00.0x70,0x00}, {0x00,0x00,0x00,0x70,0x00}, NO OP 
RDPRM-, {Ox0O,0xOO,Ox0O,0x8F,0x05). (0x00,0x00,0x00,0x80,0x05), LOWOP ), 
WRPRM". {0xOO.Ox00.OxO0,0x8F.0x25}, {0x00.0x00,0x00.0x80,0x25). LOWOP } 



Tabic C2 CMDASM Insmiction Difinition FUe 



r CMDASM (PCU microtnstniction asmebler) V 

#indude <stdio.h> 

#induds <dos.h> 

#inctude <string.h> 

^include <ciype.h> 

#include <stdargJi> 

A'ndude <stdUbi)> 

#inciude 'cmdasin.h* 

#inciude *opoodes.h* 

ffdeiine VERSION -3.3.07/31/91" 

»define(astchar(s) ((s)[strien(sH]) 

«defineistknchr(cKisainum(c) f| ({c)='_')) 

Wefine ALERT(str.arg) {fprintf(stderr.sir.arg); fdosealK): exit(-1); ) 

Sdefine ARRAYSI2E(airsy) (sizeo«{aiiay)/si2ecrf(airayl0])) 

iMeline NONE 0 

#defin8 FALSE 0 

#d8fine TRUE 1 

»defme GETWORD 100 

#deiine GETLINE 101 

#define EOFILE -1 

tfdefine EOUNE 0 
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RLE -fi. •fol. *too; 

int firstobjree » TRUE: 

long getnum(); . . 

void putop<ins_word. long, int ,irrt); 

void 2erojns(ins_word): 

void lns_cpy(ins_word, ins_word); 

char ins2sir(ins_word); 

main (argc. argv) 

int argc; 

char 'argvO; 

' char line(101l.word[101|; 
char symlab|101). numstrllOIJ, 'dummy; 
char "error; 

int i. ic. lincnt. status, adding: 
int gotsym, symnum, symcnt«0; 
int foundop, Qotcode= FALSE: 
ins_word insms\ inswrd. tmpwrd: 
tong operand; 
SYMREC slrec{1000); 
#ifdef DEBUG 

testops(); 

#endif 

2eroJns(insmsk): 
zerojns(inswrd); 

-argc; 

•{strchr(argv[0).*.-)) ^ NULL; 
argv(01 = strrchr(argv(OJ."\V) + 1 ; 

fprintf (stden-. " %s - Version %s\n*. strupr(argvlOl), VERSION): 
it ( argc<1 1| argc>3 ) { ^ „ 

fpnntf (stderr. * Needs one to three arguments: the source "); 

fprintf (stden^. "input file, the list outpuiNn file, and the '); 

fprintf (stderr. 'object output file. A default extension of \*.*); 

fprintf (stden-. "CMDN' will be\n assummed if none is provided •); 

fprintf (stden-. for the input file. If no output filename is\n'): 

fprintf (stderr. ' given the default filename is the same as *); 

fprintf (stden-, the input file. The defaultV) extensions are*); 

fprintf (stden-. * \*.LST\' and \rOBJ\'.\n-); 

exit(-l): 

openfiles (argc. argv); 

lincnt - 0; 
Ic = 0; 

fprintf (stden-. • Pass 1 .Nn'); ^ ^ , 

while( (status = getstr{GETLINEJine.100.fi)) I- EOFILE) { 
+-^llncnt; 

while ( status ^ EOLINE } { 

status » getstr (GETUNE, Une. 100, fi); 
++(incnt; 

gotsym » FALSE; 
if ( isaipha(line(01) ) { 

getstr (GEtWorO. symlab. 100. fi); 

symiab(16| = ^0'; Tif string more than 16 chars, shorten*/ 
for (symnum=>0: (symnum <symcnt) && fgotsym; ++symnum) | 
If ( fetrcmpi(symlab.sirec[symnum).sym) ) { v . . i- 

fprintf (st&n-. • WARNING: Redefinition of symbol/label %. 16s* in line %d.\n-, symlab. lincnt); 
gotsym » TRUE; 

) 

strcpy (slrec{symnum).sym. symlab); 
if ( symnum ^ symcnt ) Tsymbol not found in table / 
++symcnt; 
gotsym = TRUE; 



} 



if ( llne(0] !» 'V ) { 

* getstr(GETWiORD.word. 100.fi) !» EOUNE ) { 



word(0]^« '/ ) { 
^j^j. 



If ( getstr(GETWOR0.numstr. 100.fi) EOLINE ) { 
operand s strtol(numstr. &dummy. 0); 
if ( lermo ) ( 

if ( !stranpi(word.'.ORG')) Ic » operand: 
else if ( !sircmpi(word.'.EQU') ) { 
if (gotsym) slrec(symnum).vai = operand; 
else { 

fprintf (stden-. " ERROR: No symbol in line %d.\nMincnt); 
if ( symnum »> symcnt ) -symcnt: 
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else if ( gosym . ^ 

if { symnum = symcnt ) -symcnt; 
} * 

^•^^ ^fpriS^tdirr. - ERROR: Expected vaiue but none found in line %d.\n-. lincnt); 
if ( symnum « symcnt ) -symcnt; 

else 

for (i=0; i<ARRAYSIZE(opcodes); ++0 
if ( !strcmpi(wcrd,opcodes[n mnem) ) { 

'^^s^^^ymnumj.val = (long)Ic; 
else if (gotsym ) 

^ fprintf (strien-. ' ERROR: No opcode/command in fine %d-\n*. Kncnt): 
if ( symnum » symcnt ) 
^ -symcnt; 



r pass two of assembly V 
rewind (fi); 

lincnt = 0; 
= 0; 

lincnt 

if{rsafphanmefO])) 

getstr (GETWORO,word. 100. fi); 
if ( status «EOUNE) • . 

fprintf (fel.' :%s\n-.fine); 
else { 

adding = (linefOI— V); 

foundop = FALSE; 

error = NONE; ^ ...^ , , 

if ( aetstr(GETWORD,word,100,li) 1= EOUNE ) { 

iff teS(^rel>!o^Q-) && getstr(GETWORD,mimstr.100.li) !^ EOUNE ) 

^ operand « strtoi(numstr. &dummy, 0); 
if (lermo) 

' if(gotcod8} 

addtoobj (te, inswrd); 
gotcode » FALSE; 
to s operand: 
zeroJnsOnsinsk): 
zerojr»(mst«rd); 

t else { 

for (ic:0; p<ARRAYSIZE(opcodes) && ifoundop); ++i) 
if ( !strcmpi(word.opcodes01-mnem) ) { 
foundop = TRUE; 
tnsjcpylln'pwrd^opcodesffj.code); 
if ( opcodestq.type NOJOP ) 

\t ( g8tsir(GETWORD.numstr.100.ti) '« EOUNE ) 

operand « getnum(numsir. sirec. symcnt. Aerror): 

if (opcodesOl-type^LOWOP) 
putop(tmpwTd, operand, 8 . 4); 

else if (opcodesOl-iype -« HLOP ) 

putop(tmpwrd, operand. 24 . 16K ....w^o % 

else if ( opcodespl-type INVOP ) 

putop(tmpwrd . -operand. 24. 16); 
else if ( opcodes{i],type » l-HBIT) 
putop(tmp\wrd . operand. 39 . 1); 
else if (opcodeslil.type = Hl_15) 
putop(tmpwrd . operand . 24 . 15); 
else if (opcodes[if.fype = HM 1) 
piitDp(tmpwrd , operand , 24 , 11): 
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eise if (opcodes(n.type = INV9_U) 
putop(tmpwrd . -operand . 33 , 6); 

elso 

error » 'Expected operand and found none*: 

if ( lenror ) 
if ( adding ) 

If ( add ins(inswrd . tmpwrd . insmsK . opcodes[i].mask) f= 0) 

error =~Bitmask conttict*; 

else 

gotcode = TRUE: 
1 

else { 

if ( gotcode ) ( 
addtoobj (Ic. inswrd): 
++IC: 

gotcode ~ 'I'RUE: 
ins_cpy(insmsk , opcodesti].mask): 
tns_cpy(inswrd , tmpwrd): 

J } ' 

if ( !foundop ) 
error s "Unknown opcode*: 

fprintf (fol, -%03X %s:%s\n*. Ic. ins2str(inswrd).line): 
if ( enror ) { 
fprintf (fol. • ERROR: %s.\n\n-. error): ^ 
fprintf (stden-, ' ERROR: %s in line %d.\n*. error, lincnt): 

} ' 

addtoobj (Ic. inswrd): 

fprintf (fol, "\n\n\n*): 

for (symnum=0: symnum<symcnt: +-^symnum ) { 

if (lastt:har(slrec(symnuml.sym) J) { ^ , , , , ,^ 

fprintf (fol,'#define %-l6s Ox%IX\n*. slrec(symnum).sym. sirec|symnumj.val): 

fdoscLo: 
exit (0): 



openfiles (count names) 
int count: 
char *namesQ: 

int dotpos: 

char 'dotloc,source(101], UsHIOIJ. objectJIOIl; 

strcpy (source. strupr(names(1])): 

if ( (dottoc = strchr(source;. )) (char •)NULL ) 

dotpos a dotloc • source: 
else { 

dotpos o strfen(source): 
strcat (source. *.CMO'): 

If (-count) 

strcpy (list strupr(names(2D): 
else { 

stmcpy (list source, dotpos): 
llst{dotpos] » NULL; 

If ( (dotioc = strchr<list'.)) != (char •)NULL ) 

dotpos B dottoc -list; 
else } 

dotpos » stTlen(list): 

strcat (Bst '.LSf): 
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if (count =2) 

strcpy (object strupr(nafnes(3])); 
else { ^ ^ 

stmcpy (object, list ctotpos); 

objec^dotpos] » NULL; 

if(!strchr(objectV)) 
strcat (object •.OBJ*); 

if((fr = fopen(source,-r))==(FILE*)NUtL) { 

fprintf (steteV -FATAL ERROR: Could not open %s.\n-, source); 

fcloseailO: 
exit(-i): 

If ( (fol = fopendist -w-)) = (RLE •)NULL ) { 

fprintf (stden-. 'FATAL ERROR: Could not open %s.\n". list); 

fciosealiO; 

e)dt(-i): 

rf((foo=fopon(oyect%v-»==(FILE*)NULL) { 
fprintf (stSen-. 'FATAL ERROR: Could not open %s.\n'. object); 
fcloseailO; 
exit(-l); 
}efse( 

firstobjrecsTRUE; 



} 



getsir (action, string, length) 
int action; 
char 'string; 
int length; 

static char ime[501]; 
static int I, w. len; 
switch (action) { 
case GETLINE: 

if ( fgets(ime jenQth,fi) ) 

ien =^ strien (Tine); 
else 

retunn (EORLE); 
for (1=0; Wen; ++<) 
if(line(ll«'\n') { 

linem = V)-; 

break; 

strcpy (siring, line); 
for (1=0; Wen; -hM) 

if(line{n = 7) 

return (EOUNE); 
else rf ( ftnefqw.* || tstknchr(iinePD ) 

return (len); 
return (EOUNE); 
case GETWORD: 
if(linetll=«*r) 

return (EOUNE); 
w = 0; 

while ( klen && w-dength ) { 
if ( linePl='/ II istknchrffinelO) ) 

stringfw-M-J » Onell]; . . . 

else if ( feneOl— V II fineIO= ' II rmellJ-^'V i| line{l>«7 ) 
break; 

stringfwj = "VO'; 

whfle(l<len) { ^ ^ « 

if ( linepi— '/ 1| istknchr(BnellD || lfnell|=r ) 
break; 
++I; 

return (w); 
defeult: 
break; 
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addtoobj (loccnt, insword) 

int loccnt; 

ins word insword; 

{ " 
tnt i; 

if (firstobjrec == TRUE) firstobjrec = FALSE; 

fprintf (foo, " db 0%02XH,0%02XH-. loccni & 255. (loccnt » 8) & 255); 
for( i=NUMCHARS-1; i >= 0; H ( 
fprintf (too .-,0%02XH-. inswor^: 



rintf(foo,'Vi*): 



long getnum (string, record, count, errstr) 

char * siring; 

SYMREC recordO; 

int count; 

char "errstr; 

{ 

int i; 

long operand; 
char 'dummy; 

if (isa)pha(str1ng[0])) ( 
string! 16) = •\0'; 

for (i=0; i<count; { 

if ( {strcmpi(string,record(il.sym)) return (recordHl-val); 

^errstr = 'No match for symboi/label (0 used instead)"; 
return (OL); 
)etse{ 

operand = strtol(string. &dummy. 0); 
if(erTno) { 

'errstr *Unable to decode operand (0 used instead)'; 

operand » OL; 

^ return (operand); 



int addjns( dest^wrd , src_wrd . dest msk , src.mask) 
r routine ctiecks for bit conflicts in ana if there are none 
then ors src arguments into dest argimients// 

ins.word dest_wrd. srcjwrd . dest_msk . src.masK; 

int 1=0, badsO; 

while ( (i < NUI^CHARS) && !bad) | 

bad o ((dest_wrd[t] src.wrd|iD & dest.msk(i) & src.maskp]); 

i-M.; 

If ((bad) 

for {t^O; i<NUMCHARS ; i-M-) 

dest_wr^il 1= src^wrcfllj; 
dest_msk[Q )s src.masKfi]: 

^ return (iad): 



void putop( destjns , operand . startfoit . numbits) 
ins_word destjr^; 
long operand; 

int stanbit , numbits; . 
r routine accepts an operand and the starting bit (least signmcant bit 
index) and number of bits that it occupies in the instruction word. 
It puts each operand bit in the instruction word one at a time (not 
very elegant) V 

Int bitpos. chamum . tmpchar . i; 

i 3 0; r bit position within operand*/ 

bitpos B startbit; /*bit position within lns_word*/ 

do{ 

chamum = (WORDLENGTH-bitpos - 1) / 8; /"array index into insjwortT/ 
tmpchar = 1 « ((bitpos) % 8); 
if (operand & ( 1 « i) ) 

dest_ins(chamum] |a tmpchar 
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else 

dest_ins[chafnuni] -unpchar; 
^ ) while (++biipos, -m* < numbils) ; 

void zero Jns( word ) , j^-^^^^^^a 

r Sets every bit in the instruction word to a zero. Object onented 
programming is our friend// 
jns_word word; 

{. . 
int i; 

for (i^; t < NUMCHARS; i^) 
wordfi] = 0: 

} 

void ins_cpy (destjyvrd . src_wrd) 

r dest_wrd = src_wrd 

ins word desi wrd , src_wrd; 

{ 

inti; 

for (1=0; i<NUMCHARS; 

destjivrdO] = srcjwrdO]; 

char* !ns2str( word ) . . . 

r Converts instruction word into an asai, hex stnng. / 
insjword word; 



{ 



static chardigits[NUMCHARS * 2 + 1] : 
inti; 

for(i^; i<NUMCHARS ; i^^} 

sprintf(digte -i- i*2 ."yoOaX-, wcrdOl): 
^ return digits; 

#ifdef DEBUG 

testopsO , . ^ 

rprints out the opcodes array to see if it was initialized correctJy. / 

^ intij; 
for(IaO:|< iS:hM') 

' printfC^^Xn*. opcodesffl.mnem); 

for(i=0;i<NUMCHARS:i++) _ ^ „ 

printfC%d- %X , %X\n-. i,opcodeslfl jnaskji] . opcodesfl].codeO]); 

iendif 
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Appendix E Prototype PCU Macroinstructions in C Language 



PUTAfaa. sa) 
GETA(sa. aa) 
PUTW(wa, sa) 
GETWtsa, wa) 
MOVA(aa, aa2) 
MOVP(sa. sa2) 
A2W(wa, aa) 
W2A(aa, wa) 
SFT(sa. w) 



ADDW(aa. wa) 
ADDA(aa, aa2) 
INCA(aa) 
AND\A/(aa. wa) 
ANDA(aa. aa2) 
SUBW(aa, wa) 



MOVE 
AM SM 
SM <=:= AM 
WM SM 
SM WM 
AM AM(2) 
SM <== SM(2) 
WM <-= AM 
AM <== WM 
Circular SM. torw PEs 
sssss Arithematic 
AM <== AM + WM 
AM AM + AM{2) 
AM <== AM + 1 

AM <== AM & WM 
AM <== AM&AM(2) 
AM <== AM - WM 
AM <== AM - AM(2) 
AM <== AM - 1 
AM <== AM WM 
AM <== AM AM(2) 
AM <== -AM ^ , 

AM <== SumJAM + P ' WM) ^ner product) 



Am!sm <== Sum(sm + P * 



SUBA(aa/aa2) 
DECA{aa) 
XORW(aa, wa) 
XORA(aa. aa2) 
NOTAfaa) 
XMAC(aa. wa, off) 
MACXtaa. wa, off, sa) . 
XPRDuT(wa, wa2, oft. scale) outter product^ 
MACfaa, wa) Am <-= AM + P • WM 

MUL(aa.wa) - n • .m.- 

ASL(aa. b) 
OADDA(aa, aa2) 
OSUBAfaa, aa2j 
OSUBW(aa. wa) 
OADDW(aa, wa) 

SGEXT(aa. nbrts) _ ^ , ^ 

ADDWW(wa, wa2, nwbit. oldbits) WM 



AM <== P*WM 
AM <=:=r AM « b ^ ^ ^ . 

AM <== AM + AM{2) with overflow checking 
AM AM - AM(2) with overflow checking 
AM <== AM - Wm wrth overflow checking 
AM <== AM + Wm with overftow checking 
Sign extend (Np bits to nbits bits) ^ ^^^^ .^^^ ^ 
- ^ V.AT- WM(nwbrls) + WM2(oldbits) 

Conditbnal 



CLA(aa) Clear AM 

CLVWwa) Clear WM 

CLAZ(aa) Clear AM if Zero 

CLAC(aa) Clear AM if Carry 

CLAS(aa) Clear AM H Sign 

CLAN{aa) Clear AM if Negative 

CLANz(aa) Clear AM if Ndt Zero 

CLANCJaa) Clear AM if Not Carry 

CLANS(aa) Clear AM if Not Sign 

CLAP(aa) Clear AM if Positive 

W2A{aa, wa) WM to AM 

WM to AM if Zero 

WM to AM if Carry 
WM to AM if Sign 
WM to AM if Negative 
WM to AM if H(A Zero 
WM to AM if Not Carry 
WM to AM if Not Sign 
WM to AM if Positive 
Set Parameter 
SETC Set Carry 

CLRC Clear Carry 

SETZ Set Zero 

SETKfx) Set#ofPE,K = x ^ ^ , ^ 

SETLEN(x) Set precision of result of mul., len ^ 

SETNW(x) Set precision of WM, Nw = x 

SETNP(x) Set precision, Np « x 

MISC 



W2AZ(aia, wa) 
W2AC(aa, wa) 
W2ASI aa, wa) 
W2ANi aa, wa) 
W2ANZ(aa, wa) 
W2ANC(aa, wa) 
W2ANS(aa, wa) 
W2AP(aa, wa) 



LIMX(aa, nbits. oUbits) limit AM from oldbits to nbits (oldbits>=nbits) 
ROUNO(aa) round @aa (for Np bits) 
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Appendix F Sample PCU microinstniction Subroutine 

tempWA: .EQU 0 

X: .EQU 0 ;iustadummy 

PB- EQU 0x1000 ;DEFAULTPEs/pRING8 0xefff 

Sboard: .EQU 0 ;SA==AA 

pfaits: .EQU 0x2000 :Np ISOxefff 

wbtts: .EQU OxIeOO ;Nlw-1 ISOxflff 

tien: .EQU 0x4000 jNp+Nw 320xdfff 

GUARD: .EQU 6 .guard bits * 6 

; Define Parameter addresses — 

parmS: -EQU 8 
parmS: -EQU 9 
fen: .EQU 10 
Nw: -EQU 11 
Npsave: .EQU 12 
ksave: .EQU 13. 
lempAA: .EQU 14 
retx: .EQU 15 

pann4: .EQU 7 

parmS: -EQU 3 

parm2: .EQU 2 

parml: .EQU 1 

istart: .EQU 0 

offset: .EQU 3 ;parni3 

WA: .EQU 2 ;parm2 

tarAA: .EQU 1 33arm1 

Ic .EQU 4 

Np: -EQU 5 

SA: .EQU 6 

' ADDWW @w1 ,@w2 

addww: rdpnn parmi ;move first addend from weight 

Idwa X ; to accum tmp location 
4- rdprm tempAA ; (tmp-l acttolly) 

Idaa X 

•I- rdprm Npsave .load precision 
fdctr X ; to counter 

imp adww2 ;middle of move W -> Aimp toop 
+ rdwm .'prefetch weight memory 
aduvwl: op 0 ,AD<-Y 
f acs iwrita data to accum 

+ awe 

+ djz adww3 '.move compiete -> exit Ip 

incwa ;increment W address 
1- rdwm prefetch nxtwt bit 
adww2: rdwm ;fetch bit from wt mem 
^ op 4 :Y<-WI 
•I. phase 
•i- pcSc 

•h incaa ;incremen[t AA ad^ss 
•f- I'mp addwwl jmp to write phase 
adww3: rdprm parmi :getadr of other addend 

tdwa X : in wt mem 
-t- rdprm tempAA ;dest & other addend now in Atmp 

Idaa X 
■i- rdprm Npsave 

(dcr X ;load»ofbit5 
4- op 0 ;init carry 

pcUc 

rdwm prefetch weight data 
4- jmp adwwS :middle of add ioop 
adww4: op 10 ;AD <- sum (standard add) 

pcik ;& update flags 
^ acs :write result to accum 

^ awe 

^ djz adww6 ;when add done exit ti move result 
4. IDCWB. increment to next wt 
4> rdwm prefetch wt value 
adwwS: tncaa prennc accum addr (inc then load) 
-h rdwm :fetch addends (weight) 

acs : (and accumulator) 

f aoe 

+ op 9 ;WO<-W!. Y<-AD(load) 
1- phase -Joad phase . 

polk ;IatehWO.Y 
-r jmp adww4 :go to operate phase 
adww6: rdprm parmi ;move result to 1st opm 
Idwa X ; from accum tmp location 

rdprm tempAA : (tmp-1 actually) 
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idaa X 

rdprm Npsave ;!oad precision 



Idctr X 
adww7: incaa 

acs 
•I- aoe 

op 0 
^ phase 
4- pclk 

op 6 
+ wrwm 
+ djz setq 
incwa 



; to counter 

ipre-lnc Atmp address 
;read Amem 

;Y <- AD 
: (toad phase) 

;WO <- Y 
;wrne it wt mem 
;deanup after move (sign->Q) 
;else inc to nxt wt mem addr 



+ jmp adww7 ; loop fornxt bit 



setq: nop 
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WHAT IS CLAIMED IS: 

1. A system for processing digital data, comprising: 

5 a plurality to processing elements arranged in at 

least two interconnected processor arrays disposed about a 
communications network: 

means for delivering data to said plurality of 
10 processing elements: and 

means for implementing processing of said data 
synchronously and parallely by said plurality for 
processing elements within each of said interconnected 
15 processor arrays. 
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